groff-commit
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[groff] 01/02: preconv(1): Make style and content fixes.


From: G. Branden Robinson
Subject: [groff] 01/02: preconv(1): Make style and content fixes.
Date: Mon, 18 May 2020 05:22:23 -0400 (EDT)

gbranden pushed a commit to branch master
in repository groff.

commit edc510870925625e367d81af9ee6e2347013862f
Author: G. Branden Robinson <address@hidden>
AuthorDate: Sun May 17 12:49:10 2020 +1000

    preconv(1): Make style and content fixes.
    
    * Name: Shorten and generalize (apropos/whatis) summary.  Thanks to the
      .lf requests, preconv does more than just input transcoding.
    * Synopsis: Use hyphenated-phrase form for option arguments instead of
      pseudo-C symbol names.
    * Synopsis: Add missing .YS call.
    * Description: Recast language.
    * Options: Replace sentence about optional whitespace (an implied,
      common practice in Unix command parsers) with a brief notice about the
      early-exiting options (-h, --help, -v, --version).  Recast some
      language.  Reveal mneomonic behind "-r" option.  Sort options in
      English lexicographical order.
    * Usage: Make explicit that the algorithm stops encoding determination
      upon the first successful approach, so this doesn't have to be
      repeated later.  Note that Unicode BOMs can be detected in UTF-8,
      UTF-16, and UTF-32 input.
    * Byte order mark: Delete section.  People should be more familiar with
      Unicode now than they were ~20 years ago, and can consult Unicode's
      own extensive documentation for details.
    * Coding tags: Recast to reflect the fact that "coding tag" is a bit of
      jargon particular to preconv (and possibly some GNU Emacs users).
      Note restrictions on coding tag detection (both \" and \# are
      recognized, but control and escape characters must be the default).
    * Put each recognized coding tag on a line by itself for easier
      synchrony with the source code (and better roff style).
    * iconv support: Retitle from "iconv Issues".  Explain what becomes of
      invalid code points when using iconv.  (The issue does not arise
      without iconv support, because in the encodings preconv supports
      directly, all code points are presumed valid [cf. invalid or
      incomplete UTF-8 _sequences_].)
---
 src/preproc/preconv/preconv.1.man | 375 ++++++++++++++++++++++++--------------
 1 file changed, 240 insertions(+), 135 deletions(-)

diff --git a/src/preproc/preconv/preconv.1.man 
b/src/preproc/preconv/preconv.1.man
index 357ecaf..d4ec56b 100644
--- a/src/preproc/preconv/preconv.1.man
+++ b/src/preproc/preconv/preconv.1.man
@@ -1,7 +1,6 @@
 .TH preconv @MAN1EXT@ "@MDATE@" "groff @VERSION@"
 .SH Name
-preconv \- convert encoding of input files to something GNU troff \
-understands
+preconv \- prepare files for typesetting with GNU roff
 .
 .
 .\" Save and disable compatibility mode (for, e.g., Solaris 10/11).
@@ -37,10 +36,11 @@ understands
 .
 .SY preconv
 .OP \-dr
-.OP \-D default_encoding
+.OP \-D default-encoding
 .OP \-e encoding
 .RI [ file
 \&.\|.\|.\&]
+.YS
 .
 .SY preconv
 .B \-h
@@ -59,70 +59,102 @@ understands
 .SH Description
 .\" ====================================================================
 .
-.B preconv
-reads
-.I files
-and converts its encoding(s) to a form GNU
-.BR troff (@MAN1EXT@)
-can process, sending the data to standard output.
+.I preconv
+reads each
+.IR file ,
+converts its encoded characters to a form
+.IR groff (@MAN1EXT@)
+can interpet,
+and sends the result to the standard output stream.
+.
+Currently,
+this means that code points in the range 0\[en]127
+(in US-ASCII,
+ISO\~8859,
+or Unicode)
+remain as-is and the remainder are converted to the
+.I groff
+\[lq]special character\[rq] form
+.RB \[lq] \[rs][\c
+.BI u XXXX ]\c
+\[rq],
+where
+.I XXXX
+is a hexadecimal number of four to six digits corresponding to a Unicode
+code point.
+.
+By default,
+.I preconv
+also inserts a
+.I roff
+.B .lf
+request at the beginning of each
+.IR file ,
+identifying it for the benefit of later processing
+(including diagnostic messages);
+the
+.B \-r
+option suppresses this behavior.
 .
-Currently, this means ASCII characters and \[oq]\e[uXXXX]\[cq]
-entities, where \[oq]XXXX\[cq] is a hexadecimal number with four to
-six digits, representing a Unicode input code.
 .
-Normally,
-.B preconv
-should be invoked with the
+.PP
+In typical usage scenarios,
+.I preconv
+need not be run directly;
+instead it should be invoked with the
 .B \-k
-and
+or
 .B \-K
 options of
-.BR groff .
+.IR groff .
 .
 .
 .\" ====================================================================
 .SH Options
 .\" ====================================================================
 .
-Whitespace is permitted between a command-line option and its argument.
+.B \-h
+and
+.B \-\-help
+display a usage message,
+while
+.B \-v
+and
+.B \-\-version
+show version information;
+all exit afterward.
 .
 .
 .TP
 .B \-d
-Emit debugging messages to standard error (mainly the used encoding).
+Emit debugging messages to the standard error stream.
+.
 .
 .TP
-.BI \-D encoding
-Specify default encoding if everything fails (see below).
+.BI \-D\~ default-encoding
+Report
+.I default-encoding
+if all detection methods fail.
+.
 .
 .TP
-.BI \-e encoding
-Specify input encoding explicitly, overriding all other methods.
+.BI \-e\~ encoding
+Override detection procedure and assume
+.IR encoding .
 .
 This corresponds to
-.BR groff 's
-.BI \-K encoding
+.IR groff 's
+.RB \[lq] \-K
+.IR encoding \[rq]
 option.
 .
-Without this switch,
-.B preconv
-uses the algorithm described below to select the input encoding.
-.
-.TP
-.B \-\-help
-.TQ
-.B \-h
-Print a help message and exit.
 .
 .TP
 .B \-r
-Do not add \&.lf requests.
-.
-.TP
-.B \-\-version
-.TQ
-.B \-v
-Print the version number and exit.
+Write files \[lq]raw\[rq];
+do not add
+.B .lf
+requests.
 .
 .
 .\" ====================================================================
@@ -130,7 +162,8 @@ Print the version number and exit.
 .\" ====================================================================
 .
 .I preconv
-tries to find the input encoding with the following algorithm.
+tries to find the input encoding with the following algorithm,
+stopping at the first success.
 .
 .
 .IP 1.
@@ -140,40 +173,34 @@ use it.
 .
 .
 .IP 2.
-Otherwise,
-check whether the input starts with a Unicode Byte Order Mark
-(BOM,
-see below).
+Check whether the input starts with a Unicode Byte Order Mark.
 .
-If found,
-use it.
+If so,
+determine the encoding as UTF-8,
+UTF-16,
+or UTF-32 accordingly.
 .
 .
 .IP 3.
-Otherwise,
-if the input stream is seekable,
-check whether there is a recognized GNU\~Emacs coding tag
-(see below)
-in either the first or second input line.
+If the input stream is seekable,
+check the first and second input lines for a recognized GNU\~Emacs
+file-local variable identifying the character encoding,
+here referred to as the \[lq]coding tag\[rq] for brevity.
 .
 If found,
 use it.
 .
 .
 .IP 4.
-Otherwise,
-if the input stream is seekable,
-if the
+If the input stream is seekable,
+and if the
 .I uchardet
 library is available on the system,
 use it to try to infer the encoding of the file.
 .
 .
 .IP 5.
-If
-.I uchardet
-fails,
-and the
+If the
 .B \-D
 option specifies an encoding,
 use it.
@@ -186,9 +213,8 @@ unless the locale is
 \[lq]C\[rq],
 \[lq]POSIX\[rq],
 or empty,
-in which case assume \[lq]Latin-1\[rq]
-(ISO 8859-1)
-as the input file encoding.
+in which case assume Latin-1
+(ISO\~8859-1).
 .
 .
 .PP
@@ -219,37 +245,6 @@ environment variable which is equivalent to its option
 .
 .
 .\" ====================================================================
-.SS "Byte order mark"
-.\" ====================================================================
-.
-The Unicode Standard defines character U+FEFF as the Byte Order Mark
-(BOM).
-.
-On the other hand, value U+FFFE is guaranteed not be a Unicode character at
-all.
-.
-This allows detection of the byte order within the data stream (either
-big-endian or little-endian), and the MIME encodings \%\[oq]UTF-16\[cq]
-and \%\[oq]UTF-32\[cq] mandate that the data stream starts with U+FEFF.
-.
-Similarly, the data stream encoded as \%\[oq]UTF-8\[cq] might start
-with a BOM (to ease the conversion from and to \%UTF-16 and \%UTF-32).
-.
-In all cases, the byte order mark is
-.I not
-part of the data but part of the encoding protocol; in other words,
-.BR preconv 's
-output doesn't contain it.
-.
-.
-.PP
-Note that U+FEFF not at the start of the input data actually is
-emitted; it has then the meaning of a \[oq]zero width no-break
-space\[cq] character \[en] something not needed normally in
-.BR groff .
-.
-.
-.\" ====================================================================
 .SS "Coding tags"
 .\" ====================================================================
 .
@@ -272,72 +267,133 @@ supports the coding tag convention
 (with some restrictions)
 used by GNU\~Emacs.
 .
-Coding tags in GNU\~Emacs are indicated in specially-marked regions of
-an input file designated for \[lq]file-local variables\[rq].
+These are indicated in specially-marked regions of an input file
+designated for \[lq]file-local variables\[rq].
 .
 .
 .PP
 .I preconv
-recognizes the following syntax form if it occurs in a
+interprets the following syntax if it occurs in a
 .I roff
 comment
 in the first or second line of the input file.
 .
+Both \[lq]\[rs]"\[rq] and \[lq]\[rs]#\[rq] comment forms are recognized,
+but the control
+(or non-breaking control)
+character must be the default and must begin the line.
+.
+Similarly,
+the escape character must be the default.
+.
 .
 .RS
 .EX
-.B .\[rs]" \-*\- \c
-.RB \&.\|.\|.\& ;\~\c
+.B \-*\- \c
+.RB [.\|.\|. ; ]\~\c
 .B coding: \c
 .IB encoding ;\~\c
-\&.\|.\|.\& \c
+[.\|.\|.] \c
 .B \-*\-
 .EE
 .RE
 .
 .
 .PP
-The only tag
+The only variable
 .I preconv
 interprets is \[lq]coding\[rq],
 which can take the values listed below.
 .
 .
 .PP
-The following list comprises all MIME \[lq]charset\[rq] tags
-(either lowercase or uppercase)
-supported by
+The following list comprises all MIME \[lq]charset\[rq] parameter values
+recognized,
+case-insensitively,
+by
 .IR preconv .
 .
 .RS
-\%big5, \%cp1047, \%euc\-jp, \%euc\-kr, \%gb2312, \%iso\-8859\-1,
-\%iso\-8859\-2, \%iso\-8859\-5, \%iso\-8859\-7, \%iso\-8859\-9,
-\%iso\-8859\-13, \%iso\-8859\-15, \%koi8\-r, \%us\-ascii, \%utf\-8,
-\%utf\-16, \%utf\-16be, \%utf\-16le
+\%big5,
+\%cp1047,
+\%euc\-jp,
+\%euc\-kr,
+\%gb2312,
+\%iso\-8859\-1,
+\%iso\-8859\-2,
+\%iso\-8859\-5,
+\%iso\-8859\-7,
+\%iso\-8859\-9,
+\%iso\-8859\-13,
+\%iso\-8859\-15,
+\%koi8\-r,
+\%us\-ascii,
+\%utf\-8,
+\%utf\-16,
+\%utf\-16be,
+\%utf\-16le
 .RE
 .
 .
 .PP
 In addition,
-the following list of other tags is recognized,
+the following list of other coding tags is recognized,
 each of which is mapped to an appropriate value from the list above.
 .
 .RS
-\%ascii, \%chinese\-big5, \%chinese\-euc, \%chinese\-iso\-8bit,
-\%cn\-big5, \%cn\-gb, \%cn\-gb\-2312, \%cp878, \%csascii,
-\%csisolatin1, \%cyrillic\-iso\-8bit, \%cyrillic\-koi8, \%euc\-china,
-\%euc\-cn, \%euc\-japan, \%euc\-japan\-1990, \%euc\-korea,
-\%greek\-iso\-8bit, \%iso\-10646/utf8, \%iso\-10646/utf\-8,
-\%iso\-latin\-1, \%iso\-latin\-2, \%iso\-latin\-5, \%iso\-latin\-7,
-\%iso\-latin\-9, \%japanese\-euc, \%japanese\-iso\-8bit, \%jis8, \%koi8,
-\%korean\-euc, \%korean\-iso\-8bit, \%latin\-0, \%latin1, \%latin\-1,
-\%latin\-2, \%latin\-5, \%latin\-7, \%latin\-9, \%mule\-utf\-8,
-\%mule\-utf\-16, \%mule\-utf\-16be, \%mule\-utf\-16\-be,
-\%mule\-utf\-16be\-with\-signature, \%mule\-utf\-16le,
-\%mule\-utf\-16\-le, \%mule\-utf\-16le\-with\-signature, \%utf8,
-\%utf\-16\-be, \%utf\-16\-be\-with\-signature,
-\%utf\-16be\-with\-signature, \%utf\-16\-le,
-\%utf\-16\-le\-with\-signature, \%utf\-16le\-with\-signature
+\%ascii,
+\%chinese\-big5,
+\%chinese\-euc,
+\%chinese\-iso\-8bit,
+\%cn\-big5,
+\%cn\-gb,
+\%cn\-gb\-2312,
+\%cp878,
+\%csascii,
+\%csisolatin1,
+\%cyrillic\-iso\-8bit,
+\%cyrillic\-koi8,
+\%euc\-china,
+\%euc\-cn,
+\%euc\-japan,
+\%euc\-japan\-1990,
+\%euc\-korea,
+\%greek\-iso\-8bit,
+\%iso\-10646/utf8,
+\%iso\-10646/utf\-8,
+\%iso\-latin\-1,
+\%iso\-latin\-2,
+\%iso\-latin\-5,
+\%iso\-latin\-7,
+\%iso\-latin\-9,
+\%japanese\-euc,
+\%japanese\-iso\-8bit,
+\%jis8,
+\%koi8,
+\%korean\-euc,
+\%korean\-iso\-8bit,
+\%latin\-0,
+\%latin1,
+\%latin\-1,
+\%latin\-2,
+\%latin\-5,
+\%latin\-7,
+\%latin\-9,
+\%mule\-utf\-8,
+\%mule\-utf\-16,
+\%mule\-utf\-16be,
+\%mule\-utf\-16\-be,
+\%mule\-utf\-16be\-with\-signature,
+\%mule\-utf\-16le,
+\%mule\-utf\-16\-le,
+\%mule\-utf\-16le\-with\-signature,
+\%utf8,
+\%utf\-16\-be,
+\%utf\-16\-be\-with\-signature,
+\%utf\-16be\-with\-signature,
+\%utf\-16\-le,
+\%utf\-16\-le\-with\-signature,
+\%utf\-16le\-with\-signature
 .RE
 .
 .
@@ -353,20 +409,68 @@ are disregarded for the purpose of comparison with the 
above tags.
 .
 .
 .\" ====================================================================
-.SS "iconv Issues"
+.SS "iconv support"
 .\" ====================================================================
 .
-.B preconv
-by itself only supports three encodings: \%latin-1, cp1047, and \%UTF-8;
-all other encodings are passed to the
-.B iconv
+.I preconv
+itself only supports three encodings:
+Latin-1,
+code page 1047,
+and UTF-8.
+.
+If
+.I iconv
+support is configured at compile time and available at run time,
+all other encodings are passed to
+.I iconv
 library functions.
 .
-At compile time it is searched and checked for a valid
-.B iconv
-implementation; a call to \[oq]preconv \-\-version\[cq] shows whether
-.B iconv
-is used.
+The command
+.RB \[lq] preconv\~\-v \[rq]
+discloses whether
+.I iconv
+support is configured.
+.
+.
+.PP
+The use of
+.I iconv
+means that characters in the input that encode invalid code points for
+that encoding may be dropped from the output stream or mapped to the
+Unicode replacement character
+(U+FFFD).
+.
+Compare the following examples using the input \[lq]caf\['e]\[rq]
+(note the \[lq]e\[rq] with an acute accent),
+which due to its short length challenges inference of the encoding used.
+.
+.RS
+.EX
+printf \[aq]caf\[rs]351\[rs]n\[aq] | LC_ALL=en_US.UTF\-8 preconv
+printf \[aq]caf\[rs]351\[rs]n\[aq] | preconv \-e us\-ascii
+printf \[aq]caf\[rs]351\[rs]n\[aq] | preconv \-e latin\-1
+.EE
+.RE
+.
+The fate of the accented \[lq]e\[rq] differs in each case.
+.
+In the first,
+.I uchardet
+fails to detect an encoding
+(though the library on your system may behave differently)
+and
+.I preconv
+falls back to the locale settings,
+where octal 351 starts an incomplete UTF-8 sequence and results in the
+Unicode replacement character.
+.
+In the second,
+it is not a representable character in the declared input encoding of
+US-ASCII and is discarded by
+.IR iconv .
+.
+In the last,
+it is correctly detected and mapped.
 .
 .
 .\" ====================================================================
@@ -383,6 +487,7 @@ is used.
 .
 .
 .\" Local Variables:
+.\" fill-column: 72
 .\" mode: nroff
 .\" End:
-.\" vim: set filetype=groff:
+.\" vim: set filetype=groff textwidth=72:



reply via email to

[Prev in Thread] Current Thread [Next in Thread]