bug-groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #58930] take baby steps toward Unicode


From: Dave
Subject: [bug #58930] take baby steps toward Unicode
Date: Mon, 10 Aug 2020 10:56:08 -0400 (EDT)
User-agent: Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0

URL:
  <https://savannah.gnu.org/bugs/?58930>

                 Summary: take baby steps toward Unicode
                 Project: GNU troff
            Submitted by: barx
            Submitted on: Mon 10 Aug 2020 09:56:06 AM CDT
                Category: Core
                Severity: 3 - Normal
              Item Group: New feature
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
         Planned Release: None

    _______________________________________________________

Details:

One small change that would improve groff's Unicode support would be to
recognize Unicode versions of things groff already knows how to do.

Four examples:

==== U+00A0 NO-BREAK SPACE ====

This character is in the Latin-1 character set, which groff recognizes, and
when groff's input is in Latin-1 encoding, it correctly handles this character
(though I'm not certain whether it interprets it as "\~" or "\ ").

But if the input is some other encoding, preconv converts the character into
the string "\[u00A0]", which groff does _not_ recognize.  In macro space, a
simple

.char \[u00A0] \~

is enough to take care of this; presumably the equivalent mechanism to make
the code handle it internally is just as simple.

==== U+200B ZERO WIDTH SPACE ====

This is another character implemented in an existing groff escape (\:) but
unrecognized as "\[u200B]".

In this case, the simple, obvious, elegant solution that worked above:

.char \[u200B] \:

stupidly, irritatingly, and undocumentedly doesn't work.  (.char being unable
to map something to an escape, or at least to this particular escape, is
another bug--either in the implementation, or the lack of documentation of the
restriction--for another day.)

==== U+202F NARROW NO-BREAK SPACE ====

Groff has two nonbreaking thin spaces, \| and \^.  It is perhaps unclear which
of these groff should map "\[u202F]" to, but either one would be an
improvement over its current mapping to the warning "can't find special
character `u202F'".

==== U+2011 NON-BREAKING HYPHEN ====

I deem this change "extra credit" as it's the least likely to be easily
implementable, groff syntax having no direct correlate.  Groff can only (via
\%) make an entire "word" (sequence of non-whitespace, including hyphens)
unbreakable, but has no easy way to support a mix of breaking and nonbreaking
hyphens in the same word, such as making the first hyphen of "jack-in-the-box"
nonbreaking but the other two breakable.  (This can be done with a mix of \%
and \: escapes, as "\%jack-in-\:the-\:box" -- or even, taking advantage of the
bug/quirk Branden discovered
<http://lists.gnu.org/archive/html/groff/2020-07/msg00047.html>, as
"\%jack-in-\:the-box" -- but this is not obvious.)  So it's possible, but
convoluted, to represent "jack\[u2011]in-the-box" in groff syntax; whether
this means it's equally convoluted in the underlying code, or whether the code
actually does have the concept of a nonbreaking hyphen but just doesn't expose
a direct representation of it to user space, I cannot guess.




    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58930>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]