sed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: more on failing test 'invalid-mb-seq-UMR.sh'


From: Assaf Gordon
Subject: Re: more on failing test 'invalid-mb-seq-UMR.sh'
Date: Fri, 17 Jun 2016 00:06:34 -0400

Hello,

> On Jun 5, 2016, at 01:16, Assaf Gordon <address@hidden> wrote:
> 
> The test 'invalid-mb-seq-UMR.sh' still fails on few systems even with the 
> latest update [1].

The search continues...

I think I noticed a strange (wrong?) behavior, specific to Mac OS X (perhaps 
few other OSes) with ja_JP.eucJP and ja_JP.sjis.
It seems with these locales, mbrtowc(3) returns incorrect results.
Test program attached: it calls mbrtowc(3) trying to convert a string starting 
with '\262' (=\xB2).
This is an invalid UTF-8 character, but valid ja_JP.shiftjis character.
I haven't yet found an authoritative answer as to whether it is a valid 
ja_JP.eucJP character, but I suspect it is not:

  $ env printf '\262' | iconv -f EUC-JP -t UTF-16BE
  iconv: (stdin):1:0: incomplete character or shift sequence


Tested with:
    gcc -o test-ilseq test-ilseq.c
    for l in $(locale -a | grep ja_JP\. ) ; do
       echo LOCALE=$l ; LC_ALL=$l ./test-ilseq
    done


On Ubuntu 14.04, results seem correct:

  LOCALE=ja_JP.eucjp
  test-ilseq: mbrtowc failed (n=-1): Invalid or incomplete multibyte or wide 
character
  LOCALE=ja_JP.shiftjis
  mbtowc returned 1, wc = 65394 / ff72
  LOCALE=ja_JP.utf8
  test-ilseq: mbrtowc failed (n=-1): Invalid or incomplete multibyte or wide 
character


On Mac OS X, results are strange:
1.  The conversion succeeds in 'eucJP', and also produces 2 characters.
This is a source of the failed test in sed (invalid-mb-seq-UMR.sh),
as this consumes 1 byte from the input string, and produces two bytes.

2. The conversion is incorrect in 'SJIS' - should return 2-bytes, 0xFF72, not
one byte 0xB2 (which is just copied from the input).

  LOCALE=ja_JP.eucJP
  mbtowc returned 2, wc = 45795 / b2e3
  LOCALE=ja_JP.SJIS
  mbtowc returned 1, wc = 178 / b2
  LOCALE=ja_JP.UTF-8
  test-ilseq: mbrtowc failed (n=-1): Illegal byte sequence



Solution might be to
1. change to locale test in 'invalid-mb-seq-UMR.sh' to ja_JP.UTF-8
2. use gnulib's mbrtowc() in such cases (though quite hard to detect, if the 
system doesn't have ja_JP.eucJP locales).


to be continued,
 - assaf


Attachment: test-ilseq.c
Description: Binary data




reply via email to

[Prev in Thread] Current Thread [Next in Thread]