coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)


From: Assaf Gordon
Subject: Re: multibyte processing - handling invalid sequences (long)
Date: Thu, 21 Jul 2016 23:23:22 -0400

Hello,

> On Jul 21, 2016, at 06:08, Pádraig Brady <address@hidden> wrote:
> [...]
> It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would
> also be quite cohesive in such a util.

Attached an improved version with unicode normalization support.

Before continuing with other stuff (e.g. more tests, documentation, news, etc.),
it's worth discussing if this is the path to take (or if we want to add this to 
each individual utility).
Also, do we keep these options or modify them?
e.g. 'uconv' uses different terminology for handling invalid sequences: stop, 
skip, substitute, escape (corresponding to abort, discard, replace, recode 
below).

To keep the implementation simple, unicode normalization requires UTF-8 locales 
- is this a valid requirement?

And of course, what about the name?

Comments welcomed,
 - assaf




Example (from 'Unicode Explained' book):
===========
$ printf '\uFB01anc\u00E9\n'
fiancé

$ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfd | od -An -tx1c
  ef  ac  81  61  6e  63  65  cc  81  0a
   ?   ? 201   a   n   c   e   ? 201  \n

$ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfc | od -An -tx1c
  ef  ac  81  61  6e  63  c3  a9  0a
   ?   ? 201   a   n   c   ?   ?  \n

$ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkd | od -An -tx1c
  66  69  61  6e  63  65  cc  81  0a
   f   i   a   n   c   e   ? 201  \n

$ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkc | od -An -tx1c
  66  69  61  6e  63  c3  a9  0a
   f   i   a   n   c   ?   ?  \n

$ ./src/mbfix --help
Usage: ./src/mbfix [OPTION]... [FILE]...
Fix and adjust multibyte character in files

Mandatory arguments to long options are mandatory for short options too.
  -A, --abort          same as --policy=abort
  -C, --recode         same as --policy=recode
  -c, --check          validate input, no output
  -D, --discard        same as --policy=discard
  -n, --normalization=NORM
                       apply unicode normalization NORM:, one of:
                       nfd, nfc, nfkd, nfkc. Normalization requires
                       UTF-8 locales.
  -p, --policy=POLICY  invalid-input policy: discard, abort
                       replace (default), recode
  -R, --replace        same as --policy=replace
      --replace-char=N
                       with 'replace' policy, use unicode character N
                       (default: 0xFFFD 'REPLACEMENT CHARACTER')
      --recode-format=FMT
                       with 'recode' policy, recode invalid octets
                       using FMT printf-format (default: '<0x%02x>')
  -v, --verbose        report location of invalid input
  -z, --zero-terminated    line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/mbfix>
or available locally via: info '(coreutils) mbfix invocation'
====


Attachment: 0001-mbfix-a-new-program-to-fix-invalid-multibyte-files.patch.xz
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]