[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: more on failing test 'invalid-mb-seq-UMR.sh'
From: |
Assaf Gordon |
Subject: |
Re: more on failing test 'invalid-mb-seq-UMR.sh' |
Date: |
Fri, 17 Jun 2016 00:06:34 -0400 |
Hello,
> On Jun 5, 2016, at 01:16, Assaf Gordon <address@hidden> wrote:
>
> The test 'invalid-mb-seq-UMR.sh' still fails on few systems even with the
> latest update [1].
The search continues...
I think I noticed a strange (wrong?) behavior, specific to Mac OS X (perhaps
few other OSes) with ja_JP.eucJP and ja_JP.sjis.
It seems with these locales, mbrtowc(3) returns incorrect results.
Test program attached: it calls mbrtowc(3) trying to convert a string starting
with '\262' (=\xB2).
This is an invalid UTF-8 character, but valid ja_JP.shiftjis character.
I haven't yet found an authoritative answer as to whether it is a valid
ja_JP.eucJP character, but I suspect it is not:
$ env printf '\262' | iconv -f EUC-JP -t UTF-16BE
iconv: (stdin):1:0: incomplete character or shift sequence
Tested with:
gcc -o test-ilseq test-ilseq.c
for l in $(locale -a | grep ja_JP\. ) ; do
echo LOCALE=$l ; LC_ALL=$l ./test-ilseq
done
On Ubuntu 14.04, results seem correct:
LOCALE=ja_JP.eucjp
test-ilseq: mbrtowc failed (n=-1): Invalid or incomplete multibyte or wide
character
LOCALE=ja_JP.shiftjis
mbtowc returned 1, wc = 65394 / ff72
LOCALE=ja_JP.utf8
test-ilseq: mbrtowc failed (n=-1): Invalid or incomplete multibyte or wide
character
On Mac OS X, results are strange:
1. The conversion succeeds in 'eucJP', and also produces 2 characters.
This is a source of the failed test in sed (invalid-mb-seq-UMR.sh),
as this consumes 1 byte from the input string, and produces two bytes.
2. The conversion is incorrect in 'SJIS' - should return 2-bytes, 0xFF72, not
one byte 0xB2 (which is just copied from the input).
LOCALE=ja_JP.eucJP
mbtowc returned 2, wc = 45795 / b2e3
LOCALE=ja_JP.SJIS
mbtowc returned 1, wc = 178 / b2
LOCALE=ja_JP.UTF-8
test-ilseq: mbrtowc failed (n=-1): Illegal byte sequence
Solution might be to
1. change to locale test in 'invalid-mb-seq-UMR.sh' to ja_JP.UTF-8
2. use gnulib's mbrtowc() in such cases (though quite hard to detect, if the
system doesn't have ja_JP.eucJP locales).
to be continued,
- assaf
test-ilseq.c
Description: Binary data