coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

multibyte support (round 4) - tr


From: Assaf Gordon
Subject: multibyte support (round 4) - tr
Date: Mon, 11 Dec 2017 02:14:11 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

Hello,

Some progress with the multibyte support: partial multibyte processing with 'tr'. currently only delete/squeeze work (and not efficiently).
translation and -C/-c differences are not implemented yet.

The patch is getting too big to attach, so it is available here:
https://files.housegordon.org/src/coreutils-multibyte-2017-12-11.patch.xz
(perhaps a non-master branch on the savannah git would be better?)

The patch includes all previous code, and the last four commits
are the 'tr' implementation. Below are commit messages with examples.

For those interested, past information is available here:
  https://crashcourse.housegordon.org/coreutils-multibyte-support.html

comments welcomed,
 - assaf


====
commit f6bb9c906eaf1644f18e66fedc211a8de91057d1
Author: Assaf Gordon <address@hidden>
Date:   Fri Dec 8 21:39:45 2017 -0700

    tr: add --debug option

    Prints the content of the SET(s).
    In future patches, print multibyte-related information.

    Example:
       $ ./src/tr --debug -d 'A-Z[:digit:]\250'
       ./src/tr: hard_LC_COLLATE: yes
       ./src/tr: operating mode: delete (-d)
       ./src/tr: set: set1
       ./src/tr:   logical length: 37
       ./src/tr:   indefinite repeats: no
       ./src/tr:   has_equiv_class: no
       ./src/tr:   has_char_class: yes
       ./src/tr:   has_restricted_char_class: yes
       ./src/tr:   SpecList:
       ./src/tr:     RANGE: 'A'-'Z' (0x41 - 0x5a)
       ./src/tr:     CHAR_CLASS: [:digit:]
       ./src/tr:     NORMAL_CHAR: '' (0xa8)



commit c40e2aebe07f57b23614bb764959ece1f2156944
Author: Assaf Gordon <address@hidden>
Date:   Fri Dec 8 23:37:36 2017 -0700

    tr: support multibyte characters in SETs parameters

    The typical tr command line is 'tr SET1 SET2'
    (or 'tr -d SET1' 'tr -ds SET1 SET2' etc.)

    Previously there were only 5 types of elements in SETs:
      single character (=octet),
      range
      repeated character (=octet)
      character class (e.g. [:alpha:])
      equivalent class (e.g. [=e=])

    This adds a new type of wide character.
    These are stored only if:
    1. The current locale supports multibyte characters
    2. The multibyte sequence is valid
    3. The sequence is indeed multibyte (single octets are stored
       as before)

    Multibyte characters can only be specified using new-style
    shell-escapes in multibyte locales or entering the character directly:
        LC_ALL=en_CA.UTF-8 tr -d $'\316\250'
        LC_ALL=en_CA.UTF-8 tr -d 'Ψ'

    Escape sequences (which are un-escaped by tr itself) are never treated
    as multibyte characters. The following always deletes two octets
    (\316 and \250) regardless of active locale:
        tr -d '\316\250'

    This is likely against POSIX, but discussed here:
    https://lists.gnu.org/r/coreutils/2017-09/msg00028.html



commit 3083161add6a2f14a32718bf755cc2d3da2e8765
Author: Assaf Gordon <address@hidden>
Date:   Sat Dec 9 00:58:05 2017 -0700

    tr: optimize by skipping multibyte processing if possible

    Under certain conditions it is safe to process the input as octets
    instead of needed multibyte decoding and validation.

    These conditions are discussed here (bottom of text):
    https://lists.gnu.org/r/coreutils/2017-09/msg00028.html

An undocumented option (tr ---force-multibyte) disables the optimization
    in order to exercise the MB code path.



commit c5f812bab3602613e1140bd5d9e92d14097bc8dd
Author: Assaf Gordon <address@hidden>
Date:   Sat Dec 9 02:22:11 2017 -0700

    tr: implement multibyte delete/squeeze

    The following examples work:

    Delete character class:

      $ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -d '[:alpha:]'
      123

      $ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -d '[:lower:]'
      AЩΣ123ΠĚ

    Delete + complement:

      $ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -dc '[:lower:][:cntrl:]'
      aщπfg

      $ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -dc '[:upper:][:cntrl:]'
      AЩΣΠĚ

      $ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -dc 'Σ'
      Σ

    Squeeze repeated characters:

      $ echo "ЩЩЩщщщщ" | ./src/tr -s 'щ'
      ЩЩЩщ

      $ echo "aaaAAAAЩЩЩЩщщщщΠΠΠΠππππfĚg" \
                          | ./src/tr -s '[:lower:]'
      aAAAAЩЩЩЩщΠΠΠΠπfĚg

    Squeeze + complement:

      $ echo "aaaAAAAЩЩЩЩщщщщΠΠΠΠππππfĚg" \
                          | ./src/tr -c -s '[:lower:]'
      aaaAЩщщщщΠππππfĚg

    Delete + Squeeze:

      $ echo "aaaAAAAЩЩЩЩщщщщΠΠΠΠππππfĚg" \
                 | ./src/tr -d -s '[:upper:]' '[:lower:]'
      aщπfg






reply via email to

[Prev in Thread] Current Thread [Next in Thread]