groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: zero-width space (was Re: How to print a literal '.' as the first ch


From: G. Branden Robinson
Subject: Re: zero-width space (was Re: How to print a literal '.' as the first character in a line?)
Date: Sat, 4 Jun 2022 20:58:14 -0500

Hi Ingo,

At 2022-06-05T00:19:30+0200, Ingo Schwarze wrote:
[...]
> the groff documentation uses the term "break" very
> consistently, defining it as starting a new output line even though
> the current output line is not yet full.

Yes, and I have striven to reinforce that consistency wherever possible.

[snipping some points I have no issue with]

> In very few cases, the groff documentation uses the term "break the
> input line" to mean "start a new input line".  There is a small risk
> that might cause confusion with "breaks" in the normal sense, but i
> see no general way to avoid that risk.  In any case, all such places
> i saw clearly use the qualifier "input", so careful readers should
> not get confused.

Yes; and I have sometimes hesitated over this matter.  After doing so
much work to establish the semantics of the work "break" in new material
I've written for roff(7) and our Texinfo manual, I jealously
don't want to give away clarity by applying it to other concepts.

At the same time, I need to talk about line endings in the input stream
sometimes.

> So, to summarize, groff documentation consistently uses the word
> "break" for "line break", almost always in the sense of output line
> break and in a few clearly qualified cases for "input line break".
> 
> From this perspective, it is indeed unfortunate terminology to
> call \& a "non-printing input break" because it has no relation
> whatsoever to breaking the input line, nor to a "break" in the
> general sense, i.e. breaking the output line.
> 
> I do realize the change was committed on Sat Aug 15 22:08:01 2020,
> nearly two years ago, but when issues aren't noticed soon, finding
> them later is still better than never.

They were noticed, and discussed, at the time[1].  "Non-printing input
break" isn't my favorite term either, but I think it's _less_ misleading
than "zero-width space".

The alternatives all seem more horrible, including that in the topic,
to which I shall return.  But here are some others I've toyed with.

"formatting state reset [escape sequence]"
"tokenization break" (at least it's SHORT)
"dummy token"
"dummy character"

For this discussion I'll offer "token break".  It's short, even shorter
than "zero-width space", and much more so than the current term.

> In all ways i'm aware of, it behaves exactly like a horizontal
> spacing escape sequence (except that its width is zero) and
> exactly like a character (except that it prints an empty glyph
> of witdh zero).  So both "zero-width space" and "non-printing
> zero-width character" would seem accurate to me.  The former has
> the advantage of being shorter and agreeing with traditional
> terminology.

Any look at the device-independent output of groff or Heirloom troff
reveals problems with this operational definition.

* It doesn't put a glyph of any kind on the output, not even a blank
  one.
* Deri's right to point out that `\Z` is zero-width; in fact, if you
  give it an empty argument, it puts nothing on the output stream, not
  even a zero motion.

> It's slightly unfortunate that Unicode uses the character name "ZERO
> WIDTH SPACE" for what groff (more appropriately) calls the
> "non-printing break point" (\:), but i would consider consistency
> within the roff domain more important than using the same terms as
> Unicode.

Unicode has committed multiple sins in its life but we have to be
realistic--it has a bigger audience than any *roff implementation, and
bigger than mandoc.  I can't in clear conscience adopt this term without
my conscience compelling me to put in a footnote (in Texinfo) or
parenthetical (in a man page) explaining the deviation or disparity.

Our mission statement says we're trying to increase the quality of our
integration with and support for Unicode.  Stubbornly insisting on
surprising semantics for an already frustrating code point is not a step
in that direction, in my view.

I think it's better just to have our own term, and avoid that confusion.

> Consequently, i'm *not* advocating calling \& a "zero-width
> non-joiner"

That would be particularly bad since in its kerning-defeat application,
it's not merely a non-joiner, but an outright separator.[2]

> or a "zero width no-break space" even though both would be more
> precise if we were aiming for Unicode-compatible terminology.  Then
> again, if people worry a lot about U+200B, then calling it a "zero
> width no-break space" is still much better than calling it some kind
> of a "break".

I guess you won't be fond of "token[ization] break", then.  :-/

> The argument "it is not a space because it doesn't move and it is
> not a character because it doesn't print anything" reminds me a bit
> of the argument "0 is not a number because there is nothing there",

This argument is spurious.  I am thinking in concrete, measurable terms
of (1) input parser state and (2) device-independent output.

Any term that a novice can grasp that doesn't mislead them with respect
to either of those domains is a good candidate in my opinion.

A big problem with "zero-width space" is that it falsifies the statement
that adding a newline or multiple (regular) space characters after a
candidate end-of-sentence character results in inter-sentence spacing
being added.  (Unless there's a break for some other reason, of course.)

A novice could quite easily reason that something we go to the trouble
of _calling a space_ behaves like one--but it doesn't.  Zero-width?
Sure.  But _this_ space _cancels_ end-of-sentence detection.

> yet mathematicians certainly call it a number all the same because
> zero can be used in the same way ways as a number.  The reason why \&
> works for the escaping purposes it is used for is quite similar: it
> is treated as if it were a space or character except that it doesn't
> print nor move.  In all these cases, you can do the same escaping with
> some other spacing escape sequence or with some other character if you
> don't object to moving or printing a bit.
> 
> So, i'd say, let's call it a day (err, a space).  It certainly
> is *not* a break in any of the senses familiar from the groff
> documentation.

I'm not wedded to the word "break" here; I think it's a lesser evil, if
adequately qualified.  I am strongly opposed to "zero-width space" in
part _because_ it is so seductively simple.

Many people will think upon reading it that they know what it means.

Too much of the time, they'll be wrong.

It dishonors the art of technical writing to hand the reader a wire
whisk and invite him to stick the thing in his open skull, stirring
vigorously.

Here are 3 input exhibits.  They are intended to show the similarities
and differences between formatting a "zero-width space", an ordinary
character, and an escape sequence which resets the drawing position
after it is interpreted--but its delimited argument is empty.

I'm attaching their device-independent outputs from groff and Heirloom
Doctools troff.  I invite readers to predict that output before looking
at them, and see how their predictions bear out.

-- zwsp1.roff --
\&
.pl \n(nlu

-- zwsp2.roff --
a
.pl \n(nlu

-- zwsp3.roff --
\Z@@
.pl \n(nlu

There are some differences, immaterial to this argument (but
nevertheless interesting to me) in the implementations.

> P.S.
> Note that i'm not saying Branden is making our documentation worse,
> quite to the contrary.  This looks like an ususual slip to me.

I appreciate the qualified compliment. ;-)

While I have you, Deri raised a point in Savannah #62251 that I'd like
your feedback on.  He feels the weight of a TeX installation; other
people probably will too.  It involved re-complexifying doc/doc.am a
little bit, but I assuredly don't want to resurrect '--with-doc'.

https://savannah.gnu.org/bugs/?62551

Regards,
Branden

[1] https://lists.gnu.org/archive/html/groff/2020-07/msg00047.html
    
https://git.savannah.gnu.org/cgit/groff.git/commit/?id=36b5c8852af098e6f06dbe2ab8e452a45b43d315
[2] If one thinks in haste, one might assume there is a dichotomy here,
    and that a non-joiner is necessarily a separator.  But this is not
    true.  By analogy to (perhaps) more familiar *roff usage, an input
    character can be a candidate end-of-sentence character (.!?), an
    end-of-sentence cancellation character (letters, numbers, most
    punctuation), or _neither_: *)"'] and some special character escape
    sequences.

Attachment: zwsp1.grout
Description: Text document

Attachment: zwsp1.heirloom-out
Description: Text document

Attachment: zwsp2.grout
Description: Text document

Attachment: zwsp2.heirloom-out
Description: Text document

Attachment: zwsp3.grout
Description: Text document

Attachment: zwsp3.heirloom-out
Description: Text document

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]