>From 058bc8352d1ecf04be41eb24e77ddd79ee1c0faf Mon Sep 17 00:00:00 2001 From: Assaf Gordon Date: Sat, 25 Feb 2017 01:08:28 -0500 Subject: [PATCH] doc: expand "locale considerations" (multibyte) section Show examples of processing valid and invalid characters. Mention \L,\U for s/// command. Combines reports from: https://bugs.debian.org/500501 https://lists.gnu.org/archive/html/coreutils/2017-02/msg00039.html * doc/sed.texi (Locale Consideration): Expand section. * doc/config.texi: Add variables to render unicode characters portably. --- doc/config.texi | 32 +++++++++ doc/sed.texi | 209 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 234 insertions(+), 7 deletions(-) diff --git a/doc/config.texi b/doc/config.texi index 42c0b76..2ff9440 100644 --- a/doc/config.texi +++ b/doc/config.texi @@ -33,3 +33,35 @@ @end macro @end ifcommandnotdefined @end ifset + + address@hidden define variables that will render as characters address@hidden on both HTML (with @U{}) and PDF (with greek symbols). address@hidden Use with: @value{ucsigma} address@hidden address@hidden Based on: address@hidden http://lists.gnu.org/archive/html/help-texinfo/2012-06/msg00004.html address@hidden address@hidden ucsigma @address@hidden address@hidden iftex address@hidden address@hidden ucsigma @U{03A3} address@hidden ifnottex + address@hidden address@hidden lcsigma @address@hidden address@hidden iftex address@hidden address@hidden lcsigma @U{03C3} address@hidden ifnottex + address@hidden Unicode Replacement Character (U+FFFD): address@hidden no easy/portable tex equivalent, so use another address@hidden distinct symbol (which will be rendered very differently address@hidden then ascii characters in @examples. address@hidden address@hidden unicodeFFFD @address@hidden address@hidden iftex address@hidden address@hidden unicodeFFFD @U{FFFD} address@hidden ifnottex diff --git a/doc/sed.texi b/doc/sed.texi index 92bff01..e00eb36 100644 --- a/doc/sed.texi +++ b/doc/sed.texi @@ -2441,7 +2441,7 @@ $ seq 10 | sed -n '6,~4p' * regexp extensions:: Additional regular expression commands * Back-references and Subexpressions:: Back-references and Subexpressions * Escapes:: Specifying special characters -* Locale Considerations:: +* Locale Considerations:: Multibyte characters and locale considrations @end menu @node Regular Expressions Overview @@ -3357,16 +3357,207 @@ a^c @node Locale Considerations address@hidden Locale Considerations address@hidden Multibyte characters and Locale Considerations -TODO: fix following paragraphs (copied verbatim from 'bracket -expression' section). address@hidden processes valid multibyte characters in multibyte locales +(e.g. @code{UTF-8}). @footnote{Some regexp edge-cases depends on the +operating system and libc implementation. The examples shown are known +to work as-expected on GNU/Linux systems using glibc.} -TODO: mention locale support is heavily dependent on the OS/libc, not on sed. address@hidden The following example uses the Greek letter Capital Sigma +(@value{ucsigma}, +Unicode code point @code{0x03A3}). In a @code{UTF-8} locale, address@hidden correctly processes the Sigma as one character despite +it being 2 octets (bytes): -The current locale affects the characters matched by @command{sed}'s -regular expressions. address@hidden on address@hidden on address@hidden address@hidden +$ locale | grep LANG +LANG=en_US.UTF-8 + +$ printf 'a\u03A3b' address@hidden + +$ printf 'a\u03A3b' | sed 's/./X/g' +XXX + +$ printf 'a\u03A3b' | od -tx1 -An + 61 ce a3 62 address@hidden group address@hidden example address@hidden off address@hidden off + address@hidden +To force @command{sed} to process octets separately, use @code{C} locale +(also known as @code{POSIX} locale): + address@hidden on address@hidden on address@hidden +$ printf 'a\u03A3b' | LC_ALL=C sed 's/./X/g' +XXXX address@hidden example address@hidden off address@hidden off + address@hidden Invalid multibyte characters + address@hidden's regular expressions @emph{will not} match +invalid multibyte sequences in a multibyte locale. + address@hidden +In the following examples, the ascii value @code{0xCE} is +an incomplete multibyte character (shown here as @value{unicodeFFFD}). +The regular expression @samp{.} does not match it: + address@hidden on address@hidden on address@hidden address@hidden +$ printf 'a\xCEb\n' address@hidden + +$ printf 'a\xCEb\n' | sed 's/./X/g' address@hidden + +$ printf 'a\xCEc\n' | sed 's/./X/g' | od -tx1c -An + 58 ce 58 0a + X X \n address@hidden group address@hidden example address@hidden off address@hidden off + address@hidden Similarly, the 'catch-all' regular expression @samp{.*} will not +match the entire line: + address@hidden on address@hidden on address@hidden address@hidden +$ printf 'a\xCEc\n' | sed 's/.*//' | od -tx1c -An + ce 63 0a + c \n address@hidden group address@hidden example address@hidden off address@hidden off + address@hidden address@hidden offers the special @command{z} which can clear the +current pattern space regardless of invalid multibyte characters +(i.e. it works like @code{s/.*//} but will also remove invalid multibyte +characters): + address@hidden on address@hidden on address@hidden address@hidden +$ printf 'a\xCEc\n' | sed 'z' | od -tx1c -An + 0a + \n address@hidden group address@hidden example address@hidden off address@hidden off + address@hidden Alternatively, force the @code{C} locale to process +each octet separately (every octet is a valid character in the @code{C} +locale): + address@hidden on address@hidden on address@hidden address@hidden +$ printf 'a\xCEc\n' | LC_ALL=C sed 's/.*//' | od -tx1c -An + 0a + \n address@hidden group address@hidden example address@hidden off address@hidden off + + address@hidden's inability to process invalid multibyte characters +can be used to detect such invalid sequences in a file. +In the following examples, the @code{\xCE\xCE} is an invalid +multibyte sequence, while @code{\xCE\A3} is a valid multibyte sequence +(of the Greeg Sigma character). + address@hidden +The following @command{sed} program replaces removes all valid +characters using @code{s/.//g}. Any content left in the pattern space +(the invalid characters) are added to the hold space using the address@hidden command. On the last line (@code{$}), the hold space is retrieved +(@code{x}), newlines are removed (@code{s/\n//g}), and any remaining +octets are printed unambiguously (@code{l}). Thus, any invalid +multibyte sequences will be printed as octal values: + address@hidden on address@hidden on address@hidden address@hidden +$ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt + +$ cat invalid.txt +ab +c address@hidden@value{unicodeFFFD}de address@hidden +$ sed -n 's/.//g ; H ; address@hidden;s/\n//g;address@hidden' invalid.txt +\316\316$ address@hidden group address@hidden example address@hidden off address@hidden off + address@hidden With few more commands, @command{sed} can print +the exact line number which contains the invalid characters (line 3). +These characters can then be removed by forcing @code{C} locale +and using octal escape sequences: + address@hidden on address@hidden on address@hidden +$ sed -n 's/.//g;=;l' invalid.txt | paste - - | awk '$2!="$"' +3 \316\316$ + +$ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt address@hidden example address@hidden off address@hidden off + address@hidden Upper/Lower case conversion + + address@hidden's substitute command (@code{s}) supports upper/lower +case conversions using @code{\U},@code{\L} codes. +These conversions support multibyte characters: + address@hidden on address@hidden on address@hidden +$ printf 'ABC\u03a3\n' address@hidden + +$ printf 'ABC\u03a3\n' | sed 's/.*/\L&/' address@hidden address@hidden example address@hidden off address@hidden off + address@hidden address@hidden "s" Command}. + + address@hidden Multibyte regexp character classes + address@hidden TODO: fix following paragraphs (copied verbatim from 'bracket address@hidden expression' section). In other locales, the sorting sequence is not specified, and @samp{[a-d]} might be equivalent to @samp{[abcd]} or to @@ -3389,11 +3580,15 @@ in the current locale. TODO: show example of collation address@hidden on address@hidden on @example # TODO: this works on glibc systems, not on musl-libc/freebsd/macosx. $ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[=e=]]/X/g' clichX @end example address@hidden off address@hidden off @node advanced sed -- 2.10.2