sed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: more on failing test 'invalid-mb-seq-UMR.sh'


From: Assaf Gordon
Subject: Re: more on failing test 'invalid-mb-seq-UMR.sh'
Date: Fri, 17 Jun 2016 00:39:08 -0400

Corrected mistake below:

> On Jun 17, 2016, at 00:06, Assaf Gordon <address@hidden> wrote:
> 
> [...]
> On Mac OS X, results are strange:
> 1.  The conversion succeeds in 'eucJP', and also produces 2 characters.
> This is a source of the failed test in sed (invalid-mb-seq-UMR.sh),
> as this consumes 1 byte from the input string, and produces two bytes.
> 

The above is incorrect. Should've said:

On Mac OS X, 
mbrtowc(3) with input = '\262c'  incorrectly returned '2',
meaning it *consumed* two bytes and returned wide-char=0xb2e3 .
Later on, the wide-char is converted to multibyte character, resulting in 
2-bytes string.

To expand:

The rest 'invalid-mb-seq-UMR.sh' uses '\262C' as input (with additional 
upper-case conversion \U ).

On most gnu/linux systems, the flow is:
1. read '\262c'
2. it is detected as invalid multibyte
3. one byte '\262' is consumed, and written as-is.
4. the next byte 'c' is consumed, and written (as upper case).
5. The final output is '\262C' (0xB2 0x43).
6. Test passes.

On Mac OS X, the flow is:
1. read '\262c'
2. it is detected as valid 2-byte multibyte sequence, wide-char value of 0xB2E3
3. 2 bytes are consumed ('\262' and 'c').
4. the wide-char is converted back to multibyte 0xB2 0xE3 and written to output.
5. the test fails.


to be continued,
 - assaf





reply via email to

[Prev in Thread] Current Thread [Next in Thread]