coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tr(1) with multibyte character support


From: Assaf Gordon
Subject: tr(1) with multibyte character support
Date: Fri, 15 Sep 2017 01:15:57 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1

Hello,

I'm looking into adding multibyte support to tr(1), and interested in
some feedback.


1. "-C" vs "-c"
---------------

The POSIX tr(1) page says:
"-c  Complement the set of values specified by string1.
 -C  Complement the set of characters specified by string1."
( http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html )

This I take to mean:
    "-c" is single-bytes (=values) regardless of locale,
    "-C" is multibyte characters, depending on locale.

First,
Is the above correct?


Second,
Assuming it is correct, is the following expected output correct?

The UTF-8 sequence '\316\243' is U+03A3 GREEK CAPITAL LETTER SIGMA 'Σ'.
The UTf-8 sequence '\316\250' is U+03A8 GREEK CAPITAL LETTER PSI 'Ψ'.

POSIX unibyte locale and lower-case "-c":

  printf '\316\243\316\250' | LC_ALL=C tr -dc '\316\250'
  => '\316\316\250'


UTF-8 locale but lower-case "-c", input set should be treated
as two separate single-byte octets:

  printf '\316\243\316\250' | LC_ALL=en_US.UTf-8 tr -dc '\316\250'
  => '\316\316\250'

POSIX unibyte locale and upper-case "-C", input set should be treated
as two separate single-byte octets:

  printf '\316\243\316\250' | LC_ALL=C tr -dC '\316\250'
  => '\316\316\250'


UTF-8 locale with upper-case "-C", input is a one multibyte character:

  printf '\316\243\316\250' | LC_ALL=en_US.UTF-8 tr -dC '\316\250'
  => '\316\250'





2. Invalid multibyte sequences in SET1/SET2 parameters
------------------------------------------------------

I assume that invalid multibyte sequences in the *input* file
must be outputed as-is (in accordance with other coreutils programs).

However, what about invalid sequences in SET1/SET2 parameters?
Can we reject them (and fail/refuse to run) ?

That is, in POSIX locale, both of these are valid and mean the same
thing (delete two octet values):

     LC_ALL=C tr -d '\316\250'
     LC_ALL=C tr -d '\250\316'

But in UTF8 locale, should we accept the invalid sequence:

     LC_ALL=en_US.UTF8 tr -d '\250\316'

and treat it (silently) as two separate octets, or should we exit with
an error message (e.g. "SET1 is not valid in this locale") ?




3. backward incompatibility
---------------------------

Also related to the previous item,
I think tr(1) might be a case where adding multibyte support might break
existing scripts, and be seen as a regression by users.
If someone used commands like
   tr -d '\200-\377'
   tr -d '\316\250'
And these have worked for many years regardless of locale, adding
multibyte support might disrupt this.

What do you think ? perhaps this usage is not so common, and it won't be
too big of a disruption ?





thanks for reading,
 - assaf











reply via email to

[Prev in Thread] Current Thread [Next in Thread]