>From 058bc8352d1ecf04be41eb24e77ddd79ee1c0faf Mon Sep 17 00:00:00 2001
From: Assaf Gordon <address@hidden>
Date: Sat, 25 Feb 2017 01:08:28 -0500
Subject: [PATCH] doc: expand "locale considerations" (multibyte) section

Show examples of processing valid and invalid characters.
Mention \L,\U for s/// command.
Combines reports from:
 https://bugs.debian.org/500501
 https://lists.gnu.org/archive/html/coreutils/2017-02/msg00039.html

* doc/sed.texi (Locale Consideration): Expand section.
* doc/config.texi: Add variables to render unicode characters portably.
---
 doc/config.texi |  32 +++++++++
 doc/sed.texi    | 209 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 234 insertions(+), 7 deletions(-)

diff --git a/doc/config.texi b/doc/config.texi
index 42c0b76..2ff9440 100644
--- a/doc/config.texi
+++ b/doc/config.texi
@@ -33,3 +33,35 @@
 @end macro
 @end ifcommandnotdefined
 @end ifset
+
+
address@hidden define variables that will render as characters
address@hidden on both HTML (with @U{}) and PDF (with greek symbols).
address@hidden Use with: @value{ucsigma}
address@hidden
address@hidden Based on:
address@hidden http://lists.gnu.org/archive/html/help-texinfo/2012-06/msg00004.html
address@hidden
address@hidden ucsigma @address@hidden
address@hidden iftex
address@hidden
address@hidden ucsigma @U{03A3}
address@hidden ifnottex
+
address@hidden
address@hidden lcsigma @address@hidden
address@hidden iftex
address@hidden
address@hidden lcsigma @U{03C3}
address@hidden ifnottex
+
address@hidden Unicode Replacement Character (U+FFFD):
address@hidden no easy/portable tex equivalent, so use another
address@hidden distinct symbol (which will be rendered very differently
address@hidden then ascii characters in @examples.
address@hidden
address@hidden unicodeFFFD @address@hidden
address@hidden iftex
address@hidden
address@hidden unicodeFFFD @U{FFFD}
address@hidden ifnottex
diff --git a/doc/sed.texi b/doc/sed.texi
index 92bff01..e00eb36 100644
--- a/doc/sed.texi
+++ b/doc/sed.texi
@@ -2441,7 +2441,7 @@ $ seq 10 | sed -n '6,~4p'
 * regexp extensions::        Additional regular expression commands
 * Back-references and Subexpressions:: Back-references and Subexpressions
 * Escapes::                  Specifying special characters
-* Locale Considerations::
+* Locale Considerations::    Multibyte characters and locale considrations
 @end menu
 
 @node Regular Expressions Overview
@@ -3357,16 +3357,207 @@ a^c
 
 
 @node Locale Considerations
address@hidden Locale Considerations
address@hidden Multibyte characters and Locale Considerations
 
-TODO: fix following paragraphs (copied verbatim from 'bracket
-expression' section).
address@hidden processes valid multibyte characters in multibyte locales
+(e.g. @code{UTF-8}).  @footnote{Some regexp edge-cases depends on the
+operating system and libc implementation. The examples shown are known
+to work as-expected on GNU/Linux systems using glibc.}
 
-TODO: mention locale support is heavily dependent on the OS/libc, not on sed.
address@hidden The following example uses the Greek letter Capital Sigma
+(@value{ucsigma},
+Unicode code point @code{0x03A3}). In a @code{UTF-8} locale,
address@hidden correctly processes the Sigma as one character despite
+it being 2 octets (bytes):
 
-The current locale affects the characters matched by @command{sed}'s
-regular expressions.
address@hidden on
address@hidden on
address@hidden
address@hidden
+$ locale | grep LANG
+LANG=en_US.UTF-8
+
+$ printf 'a\u03A3b'
address@hidden
+
+$ printf 'a\u03A3b' | sed 's/./X/g'
+XXX
+
+$ printf 'a\u03A3b' | od -tx1 -An
+ 61 ce a3 62
address@hidden group
address@hidden example
address@hidden off
address@hidden off
+
address@hidden
+To force @command{sed} to process octets separately, use @code{C} locale
+(also known as @code{POSIX} locale):
+
address@hidden on
address@hidden on
address@hidden
+$ printf 'a\u03A3b' | LC_ALL=C sed 's/./X/g'
+XXXX
address@hidden example
address@hidden off
address@hidden off
+
address@hidden Invalid multibyte characters
+
address@hidden's regular expressions @emph{will not} match
+invalid multibyte sequences in a multibyte locale.
+
address@hidden
+In the following examples, the ascii value @code{0xCE} is
+an incomplete multibyte character (shown here as @value{unicodeFFFD}).
+The regular expression @samp{.} does not match it:
+
address@hidden on
address@hidden on
address@hidden
address@hidden
+$ printf 'a\xCEb\n'
address@hidden
+
+$ printf 'a\xCEb\n' | sed 's/./X/g'
address@hidden
+
+$ printf 'a\xCEc\n' | sed 's/./X/g' | od -tx1c -An
+  58  ce  58  0a
+   X      X   \n
address@hidden group
address@hidden example
address@hidden off
address@hidden off
+
address@hidden Similarly, the 'catch-all' regular expression @samp{.*} will not
+match the entire line:
+
address@hidden on
address@hidden on
address@hidden
address@hidden
+$ printf 'a\xCEc\n' | sed 's/.*//' | od -tx1c -An
+  ce  63  0a
+       c  \n
address@hidden group
address@hidden example
address@hidden off
address@hidden off
+
address@hidden
address@hidden offers the special @command{z} which can clear the
+current pattern space regardless of invalid multibyte characters
+(i.e. it works like @code{s/.*//} but will also remove invalid multibyte
+characters):
+
address@hidden on
address@hidden on
address@hidden
address@hidden
+$ printf 'a\xCEc\n' | sed 'z' | od -tx1c -An
+   0a
+   \n
address@hidden group
address@hidden example
address@hidden off
address@hidden off
+
address@hidden Alternatively, force the @code{C} locale to process
+each octet separately (every octet is a valid character in the @code{C}
+locale):
+
address@hidden on
address@hidden on
address@hidden
address@hidden
+$ printf 'a\xCEc\n' | LC_ALL=C sed 's/.*//' | od -tx1c -An
+  0a
+  \n
address@hidden group
address@hidden example
address@hidden off
address@hidden off
+
+
address@hidden's inability to process invalid multibyte characters
+can be used to detect such invalid sequences in a file.
+In the following examples, the @code{\xCE\xCE} is an invalid
+multibyte sequence, while @code{\xCE\A3} is a valid multibyte sequence
+(of the Greeg Sigma character).
+
address@hidden
+The following @command{sed} program replaces removes all valid
+characters using @code{s/.//g}.  Any content left in the pattern space
+(the invalid characters) are added to the hold space using the
address@hidden command. On the last line (@code{$}), the hold space is retrieved
+(@code{x}), newlines are removed (@code{s/\n//g}), and any remaining
+octets are printed unambiguously (@code{l}).  Thus, any invalid
+multibyte sequences will be printed as octal values:
+
address@hidden on
address@hidden on
address@hidden
address@hidden
+$ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt
+
+$ cat invalid.txt
+ab
+c
address@hidden@value{unicodeFFFD}de
address@hidden
 
+$ sed -n 's/.//g ; H ; address@hidden;s/\n//g;address@hidden' invalid.txt
+\316\316$
address@hidden group
address@hidden example
address@hidden off
address@hidden off
+
address@hidden With few more commands, @command{sed} can print
+the exact line number which contains the invalid characters (line 3).
+These characters can then be removed by forcing @code{C} locale
+and using octal escape sequences:
+
address@hidden on
address@hidden on
address@hidden
+$ sed -n 's/.//g;=;l' invalid.txt | paste - -  | awk '$2!="$"'
+3       \316\316$
+
+$ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt
address@hidden example
address@hidden off
address@hidden off
+
address@hidden Upper/Lower case conversion
+
+
address@hidden's substitute command (@code{s}) supports upper/lower
+case conversions using @code{\U},@code{\L} codes.
+These conversions support multibyte characters:
+
address@hidden on
address@hidden on
address@hidden
+$ printf 'ABC\u03a3\n'
address@hidden
+
+$ printf 'ABC\u03a3\n' | sed 's/.*/\L&/'
address@hidden
address@hidden example
address@hidden off
address@hidden off
+
address@hidden
address@hidden "s" Command}.
+
+
address@hidden Multibyte regexp character classes
+
address@hidden TODO: fix following paragraphs (copied verbatim from 'bracket
address@hidden expression' section).
 
 In other locales, the sorting sequence is not specified, and
 @samp{[a-d]} might be equivalent to @samp{[abcd]} or to
@@ -3389,11 +3580,15 @@ in the current locale.
 
 TODO: show example of collation
 
address@hidden on
address@hidden on
 @example
 # TODO: this works on glibc systems, not on musl-libc/freebsd/macosx.
 $ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[=e=]]/X/g'
 clichX
 @end example
address@hidden off
address@hidden off
 
 
 @node advanced sed
-- 
2.10.2