[bug #58930] take baby steps toward Unicode

bug-groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #58930] take baby steps toward Unicode

From:	G. Branden Robinson
Subject:	[bug #58930] take baby steps toward Unicode
Date:	Sat, 28 May 2022 14:52:58 -0400 (EDT)

Follow-up Comment #14, bug #58930 (project groff):

Hi Dave,

[comment #13 comment #13:]

> "The input sequence '\[u00A0]' is _syntactically_ valid...but like
'\[uFFFF]' and '\[u0000]', it's not _meaningful_"
> 
> This is true of the current implementation but less true conceptually:
U+0000 and U+FFFF are not meaningful input characters to groff, but U+00A0 is,
and users ideally ought to be able to specify the character as \[u00A0].

I guess what we need to do here is be more clear whether \[u...] escape
sequences are intended to represent input characters, or desired output
glyphs.  The distinction comes into sharp relief here.

If the former, then you're right--\[u...] is a way of getting around groff's
narrow-character input interpretation.

But if the latter, then there are many other Unicode code points that don't
represent things we can ask a font to draw for us.  In groff, with two
exceptions, if an ordinary or special character doesn't correspond to
something that "puts ink on the page", it isn't a glyph.  In
device-independent output, spaces of all kinds are represented with horizontal
or vertical motions, not glyph-writing commands.

The two exceptions are only partial ones: \| and \^.  One might interpret them
as special character escape sequences like \- or \_, but they aren't.  They
become horizontal motions.  It is true that these two sequences can be defined
as if they were special characters in font description files to customize
their widths (groff_font(5) discusses this).  With my present understanding, I
think this is a bit of a bodge, and as far as I've seen, no one has ever taken
advantage of this configurability.  (Possibly because *roff font description
files have been regarded as a bit esoteric.)  This approach does not
generalize well to the many additional space code points in Unicode.  U+2001,
"EM QUAD", is just one example.

_Maybe_ there's a good reason to have \^ and \| customizable in this way--I
don't feel like I have a command of the history of this topic.

If you or anybody has some knowledge to bring to light here, I'd appreciate
it!


> 
> But this is an edge case I don't intend to pursue.  Users who want to stick
to pure-ASCII input have the escape sequence \~ to specify the nonbreaking
space, so don't need the alternate spelling \[u00A0].


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #58930] take baby steps toward Unicode, Dave, 2022/05/28
- [bug #58930] take baby steps toward Unicode, G. Branden Robinson <=
  - [bug #58930] take baby steps toward Unicode, Dave, 2022/05/29
    - [bug #58930] take baby steps toward Unicode, Dave, 2022/05/29

Prev by Date: [bug #62535] groff(1): configuration variable "PSPRINT" can be empty
Next by Date: [bug #62494] [grotty] Remap ~ and ^ to their ASCII equivalents
Previous by thread: [bug #58930] take baby steps toward Unicode
Next by thread: [bug #58930] take baby steps toward Unicode
Index(es):
- Date
- Thread