bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#37634: Non-charset characters are not recognized.


From: Assaf Gordon
Subject: bug#37634: Non-charset characters are not recognized.
Date: Sat, 5 Oct 2019 19:37:11 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

tag 37634 notabug
close 37634
stop

Hello,

On 2019-10-05 8:26 a.m., address@hidden wrote:
sed doesn't recognize non-charset characters as . (dot), eg with
LC_CTYPE=en_US.UTF-8:

# printf "ABCD\n" | sed 's/B.*C//'
Output: AD
# printf "AB\x8eCD\n" | sed 's/B.*C//'
Output: AB�CD

That is correct observation, and it is by design.
POSIX mandates that "." can only match valid characters in the current locale.
As you observed, "\x8e" by itself is not a valid UTF-8 character.

I also tried something like [^E]* instead of .* but that also does not work.
I think sed should recognize \x8e is not a C or newline even though it's
not in the character set.

Changing the current behavior will be backwards-incompatible and standard-breaking, and is not likely to happen
(but, if you want, you are welcome to suggest such changes to POSIX
at https://www.opengroup.org/austin/ ).

As a side note, this is exactly why GNU sed introduced the non-standard 'z' command to clear the pattern space: because a simple 's/.*//' won't
work with invalid UTF-8 input in UTF-8 locales.

With

# printf "AB\x8eCD\n" | LC_CTYPE=C sed 's/B.*C//'
Output: AD

it works but, that's a bit non-intuitive, because normally one wants to
have UTF8-charset and sed to function correctly

Here's the contradiction: "to function correctly" also assumes the input
is valid (=valid UTF-8 stream, without invalid multibyte sequences).

Technically speaking, streams with invalid UTF-8 bytes are not text -
they are considered "binary" data.
And as you wrote, forcing "C" locale allows you to handle individual byte values, including those that are invalid in other locales.

is there an
other regex similar to . that can recognize such characters?

By definition, invalid multibyte sequences are not "characters",
and so will not be recognized by sed's regular expression under
multibyte locales.


As such, I'm closing this as "not a bug" but discussion can continue by replying to this thread.

regards,
 - assaf





reply via email to

[Prev in Thread] Current Thread [Next in Thread]