groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 1.23 prints some strange error


From: Walter Alejandro Iglesias
Subject: Re: 1.23 prints some strange error
Date: Wed, 25 Oct 2023 14:25:42 +0200

On Wed, Oct 25, 2023 at 05:03:36AM -0500, G. Branden Robinson wrote:
> Hi Walter & Dave,
> 
> At 2023-09-11T19:45:30+0200, Walter Alejandro Iglesias wrote:
> > If instead of sourcing hyphen.tr from my macros with .mso I source it
> > directly from the roff document with .so those error messages
> > desapear.
> 
> As Dave mentioned, this is explained by soelim(1) not being run on the
> "macro sourced" file.  As a rule, I think files to be read with the
> `mso` request should be in plain ASCII only.  The whole point of a macro
> file suitable for general use is that it...gets used generally, which
> means that documents employing a variety of input encodings might employ
> it.  You therefore should use the lowest common denominator character
> encoding for it: ASCII.  (Strictly, ISO 646:1991-IRV.)
> 
> That doesn't mean you have to do much more work or spend a lot of time
> staring at groff_char(7) and learning the special character identifiers
> for the upper half of ISO 8859-1.  You can still have your macro sourced
> file in Latin-1; just run preconv over it stand-alone as a converter.
> 
> $ printf '.ds aunt la t\\355a\n' > family.mso.in
> $ preconv -e latin1 family.mso.in > family.mso
> 
> Part of the preconv(1) man page is likely worth reviewing.
> 
>    iconv support
> [...]
>        The use of iconv means that characters in the input that encode
>        invalid code points for that encoding may be dropped from the
>        output stream or mapped to the Unicode replacement character
>        (U+FFFD).  Compare the following examples using the input “café”
>        (note the “e” with an acute accent), which due to its short
>        length challenges inference of the encoding used.
>               printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
>               printf 'caf\351\n' | preconv -e us-ascii
>               printf 'caf\351\n' | preconv -e latin-1
>        The fate of the accented “e” differs in each case.  In the first,
>        uchardet fails to detect an encoding (though the library on your
>        system may behave differently) and preconv falls back to the
>        locale settings, where octal 351 starts an incomplete UTF‐8
>        sequence and results in the Unicode replacement character.  In
>        the second, it is not a representable character in the declared
>        input encoding of US‐ASCII and is discarded by iconv.  In the
>        last, it is correctly detected and mapped.
> [...]
>    Limitations
>        preconv cannot perform any transformation on input that it cannot
>        see.  Examples include files that are interpolated by
>        preprocessors that run subsequently, including soelim(1); files
>        included by troff itself through “so” and similar requests; and
>        string definitions passed to troff through its -d command‐line
>        option.
> 
> Maybe I should add my adminition above about macro-sourced files to this
> man page.
> 
> At 2023-09-12T11:16:58+0200, Walter Alejandro Iglesias wrote:
> > I cleaned up a bit the quoted text to make room for the following.  Here
> > we go:
> > 
> >   $ uname -a
> >   Linux bell 6.4.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.4.13-1 
> > (2023-08-31) x86_64 GNU/Linux
> >   $ groff --version | head -1
> >   GNU groff version 1.23.0
> >   $ mkdir test
> >   $ cd test
> >   $ cat << EOF > doc.tr
> >   .mso list.tr
> >   EOF
> >   $ cat << EOF > list.tr
> >   .hw a-hí
> >   .hw a-ño
> >   .hw ár-bol
> >   .hw cu-brí-a
> >   .hw e-té-re-o
> >   .hw ca-mión
> >   .hw ú-te-ro
> >   .hw pin-güi-no
> >   EOF
> >   $ GROFF_TMAC_PATH=. nroff doc.tr
> >   troff:./list.tr:1: error: expected ordinary or special character, got an 
> > escaped '%'
> >   troff:./list.tr:4: error: expected ordinary or special character, got an 
> > escaped '%'
> 
> This transcript isn't as useful as it could be, because it didn't
> disclose to me what character encoding was used for list.tr on the file
> system.  Running the file(1) command on it and sharing that would help.

I think I said it several times that list.tr is a utf-8 file.  And I
wouldn't trust file(1) on that.

> 
> > As you see, from the UTF-8 chars used in Spanish (á, é, í, ó, ú, ü,
> > ñ), groff seems to only have problems with the 'í' in particular.
> > Let's try another test using preconv(1).
> 
> preconv is probably using iconv(3) on your system ("preconv --version"
> will tell you).  iconv's heuristics for guessing the encoding are opaque
> to groff (and to me).

In OpenBSD preconv (1.22.4) is compiled without iconv.

I had to downgrade Devuan to stable, which comes with groff 1.22.4, and
preconv compiled *with* iconv.  I cannot reproduce the bug here.  So,
this has all the numbers to be a regression, in your place I'd try to
figure out in with patch between 1.22.4 and the current version was
introduced.

I know that my bug report isn't as helpful as it could be, but right now
I'm doing other things, sorry.


> 
> > The errors remain.  Finally, I told you that changing .mso request to
> > .so made the error messages disappear, that's because in my Makefile I
> > run soelim(1) before.  Last test:
> > 
> >   $ cat << EOF > doc.tr
> >   .hla es
> >   .so list.tr               \" notice here I changed the request
> >   Ahí, el árbol nos cubría con su sombra.
> >   Un pingüino pasaba caminando por la playa.
> >   EOF
> >   $ preconv -e UTF-8 doc.tr | nroff | cat -s
> >   troff:./list.tr:1: error: expected ordinary or special character, got an 
> > escaped '%'
> >   troff:./list.tr:3: error: expected ordinary or special character, got an 
> > escaped '%'
> >   Ahí, el árbol nos cubría con su sombra.  Un pingüino pasaba cami‐
> >   nando por la playa.
> >   $ soelim doc.tr | preconv -e UTF-8 | nroff | cat -s
> >   Ahí, el árbol nos cubría con su sombra.  Un pingüino pasaba cami‐
> >   nando por la playa.
> > 
> > This last command throws no error, that's because soelim(1) allows
> > preconv(1) to process the list.tr file.
> 
> Right, I think that's the right strategy precisely.  You can maintain
> the file you want to `mso` in version control in whatever character
> encoding is comfortable for you--I'd store it as an ".in" file and have
> make(1) run preconv(1) over it when constructing documents that use it.
> 
> > Anyways.  My doubt comes from the fact that so far (with groff 1.22.4
> > under OpenBSD) I haven't needed to preprocess that .hw list with
> > preconv,
> 
> OpenBSD is notoriously minimalistic.  You might see if `preconv
> --version` there reports use of iconv...except...uh, I think revealing
> that information is something I added _after_ the groff 1.22.4 release.

Answered above.

> 
> So here's another paragraph from preconv(1) that might explain the
> behavior on OpenBSD.
> 
>    iconv support
>        While preconv recognizes all of the coding tags listed above, it
>        is capable on its own of interpreting only three encodings:
>        Latin‐1, code page 1047, and UTF‐8.  If iconv support is
>        configured at compile time and available at run time, all others
>        are passed to iconv library functions, which may recognize many
>        additional encoding strings.  The command “preconv -v” discloses
>        whether iconv support is configured.
> 
> Unfortunately I don't know of an example of an encoding name that is a
> reliable test for iconv support being absent.
> 
> > and that only the 'í' (iacute) triggers the error.
> 
> I think this might be explained by iconv(3)'s heuristic approach.
> 
> On my system, I confirmed that nothing crazy was going on with the
> following experiments.
> 
> $ printf 'caf\351\n' | preconv -e latin1
> .lf 1 -
> caf\[u00E9]
> $ printf 'la t\355a\n' | preconv -e latin1 | nroff | head -n 1
> la tía
> $ printf 'la t\355a\n' | nroff -K latin1 | head -n 1
> la tía
> $ printf 'la t\355a\n' | nroff | head -n 1
> la tía
> 
> At 2023-10-05T10:45:32+0200, Walter Alejandro Iglesias wrote:
> > If I feed preconv with a file already in latin1 (using UTF-8 locales
> > here) ...
> > 
> >   $ preconv -e utf8 list_in_latin1.tr
> > 
> > ... *all* non ASCII characters in the output are replaced by \[uFFFD].
> 
> Yes, because the `-e` flag _describes the character encoding of the
> input_.
> 
> Description
>        preconv reads each file, converts its encoded characters to a
>        form troff(1) can interpret, and sends the result to the standard
>        output stream.
> [...]
> Options
> [...]
>        -e encoding
>               Skip detection and assume encoding; see groff’s -K option.
> 
> Do not try to tell preconv the desired character encoding of the
> _output_; that's not its job.  Its job is to normalize the input so that
> GNU troff(1) can read it.
> 
> The character encoding of the output is inapplicable to GNU troff(1)
> itself; it, like all device-independent troffs, writes an ASCII-encoded
> plain text file.  An output driver like grotty(1) translates troff(1)
> output into whatever is appropriate for the device, which is why groff's
> terminal output devices are named things like "ascii", "latin1" and
> "utf8".
> 
> At 2023-10-12T16:46:07-0500, Dave Kemper wrote:
> > On 10/5/23, Walter Alejandro Iglesias <wai@roquesor.com> wrote:
> > > If I feed preconv with a file already in latin1 (using UTF-8 locales
> > > here) ...
> > >
> > >   $ preconv -e utf8 list_in_latin1.tr
> > >
> > > ... *all* non ASCII characters in the output are replaced by \[uFFFD].
> > 
> > Yes, this would be expected to not work.  preconv's "-e" option
> > specifies the *input* encoding.  So if the input file is in Latin-1,
> > but you tell preconv that it's in UTF-8, you'd expect things to go
> > awry.
> 
> Right.
> 
> > But that's not the full explanation: *all* Latin-1 characters are
> > multiple bytes when encoded as UTF-8.
> 
> Strictly, Latin-1 is an 8-bit character encoding.  You might say here
> "all characters from the Unicode Latin-1 extension block" instead.
> 
> Ya know, if you're a stickler.
> 
> > So if iacute (Latin-1 0xED) is misread in the way Bjarni describes,
> > the same should happen to all the other Latin-1 characters as well.
> > The fact groff is treating one Latin-1 character differently from the
> > others carries the whiff of a bug.
> 
> I'm prepared to chalk this up to iconv heuristic conversion in the
> absence of other information.  See my attempted reproducers above.
> 
> Regards,
> Branden



-- 
Walter



reply via email to

[Prev in Thread] Current Thread [Next in Thread]