bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Bruno Haible
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Sat, 22 Jul 2023 02:33:04 +0200

[Quick answer on this part:]

Paul Eggert wrote:
> What does mbiterf do in non-UTF-8 multi-byte locales? How can it tell 
> how long the invalid sequence is?

It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
  - a complete character, or
  - an invalid character, or
  - an incomplete character (i.e. if additional bytes may lead to a
    complete character).

For example, for EUC-JP, it has these scanning rules:
  1st byte in [0x00..0x7F] => complete character
  1st byte in [0x8E..0x8F] ∪ [0xA1..0xA8] ∪ [0xB0..0xFE]
       => incomplete character with 1 byte so far
    1st byte != 0x8F and 2nd byte in [0xA1..0xFE]
        => either complete or invalid character
    1st byte == 0x8F and 2nd byte in {0xA2} ∪ [0xA6..0xA7] ∪ [0xA9..0xAB] ∪ 
[0xB0..0xFE]
        => incomplete character with 2 bytes so far
    1st byte == 0x8F and 2nd byte has another value
        => invalid character
  1st byte has another value
      => invalid character

Similarly for each other encoding.

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]