[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] mom : unicode in .INCLUDE'd files
From: |
Ingo Schwarze |
Subject: |
Re: [Groff] mom : unicode in .INCLUDE'd files |
Date: |
Sun, 23 Jul 2017 15:02:10 +0200 |
User-agent: |
Mutt/1.6.2 (2016-07-01) |
Hi Ralph,
Ralph Corderoy wrote on Sun, Jul 23, 2017 at 12:38:16PM +0100:
> UTF-8 comes along and groff can't adopt it because it's already
> taken an incompatible fork.
In theory, that's true.
If you see a two-, three-, or four-byte sequence that forms a
valid UTF-8 character, it could theoretically be a sequence of
two, three, or four ISO-LATIN-1 characters.
In practice, these combinations of ISO-LATIN-1 characters are
nonsensical and simply do not occur in real-world files.
So, you can simply read the file up to the first byte with the
high bit set. If that starts are valid UTF-8 sequence, the file
is UTF-8. Otherwise, it is ISO-LATIN-1. I'm doing exactly that
in mandoc, and i have never seen a misclassification in practice.
Groff could do the same and remain backward-compatible.
That doesn't even require a heavy, sophisticated library like
uchardet.
If somebody insists on processing a maliciously crafted ISO-LATIN-1
file where the first non-ASCII byte sequence looks like UTF-8, they
will have to put a charset annotation into the file or use a -K
option. But that won't get into the way of processing historical
files because those just won't contain such nonsense.
Of course, to process native UTF-16 on Windows, preconv will be
needed just like now. No change there. Oh, maybe things get easier
even on Windows, because you gain the additional option to use
"iconv -t UTF-8" just like for any other text file and don't
necessarily need the special preconv(1) tool any longer.
Yours,
Ingo
- Re: [Groff] mom : unicode in .INCLUDE'd files, (continued)
- Re: [Groff] mom : unicode in .INCLUDE'd files, Ingo Schwarze, 2017/07/22
- Re: [Groff] mom : unicode in .INCLUDE'd files, John Gardner, 2017/07/22
- Re: [Groff] mom : unicode in .INCLUDE'd files, Keith Marshall, 2017/07/22
- Re: [Groff] mom : unicode in .INCLUDE'd files, John Gardner, 2017/07/22
- Re: [Groff] mom : unicode in .INCLUDE'd files, Mike Bianchi, 2017/07/22
- Re: [Groff] mom : unicode in .INCLUDE'd files, Ted Harding, 2017/07/22
- Re: [Groff] mom : unicode in .INCLUDE'd files, Keith Marshall, 2017/07/22
- Re: [Groff] mom : unicode in .INCLUDE'd files, Ralph Corderoy, 2017/07/23
- Re: [Groff] mom : unicode in .INCLUDE'd files, Keith Marshall, 2017/07/23
- Re: [Groff] mom : unicode in .INCLUDE'd files, Ralph Corderoy, 2017/07/23
- Re: [Groff] mom : unicode in .INCLUDE'd files,
Ingo Schwarze <=
- Re: [Groff] mom : unicode in .INCLUDE'd files, Steffen Nurpmeso, 2017/07/22