bug#37634: Non-charset characters are not recognized.

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#37634: Non-charset characters are not recognized.

From:	Assaf Gordon
Subject:	bug#37634: Non-charset characters are not recognized.
Date:	Sat, 5 Oct 2019 19:37:11 -0600
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

tag 37634 notabug
close 37634
stop

Hello,

On 2019-10-05 8:26 a.m., address@hidden wrote:

sed doesn't recognize non-charset characters as . (dot), eg with
LC_CTYPE=en_US.UTF-8:

# printf "ABCD\n" | sed 's/B.*C//'
Output: AD
# printf "AB\x8eCD\n" | sed 's/B.*C//'
Output: AB�CD


That is correct observation, and it is by design.

POSIX mandates that "." can only match valid characters in the currentlocale.

As you observed, "\x8e" by itself is not a valid UTF-8 character.

I also tried something like [^E]* instead of .* but that also does not work.
I think sed should recognize \x8e is not a C or newline even though it's
not in the character set.

Changing the current behavior will be backwards-incompatible andstandard-breaking, and is not likely to happen

(but, if you want, you are welcome to suggest such changes to POSIX
at https://www.opengroup.org/austin/ ).

As a side note, this is exactly why GNU sed introduced the non-standard'z' command to clear the pattern space: because a simple 's/.*//' won't

work with invalid UTF-8 input in UTF-8 locales.

With

# printf "AB\x8eCD\n" | LC_CTYPE=C sed 's/B.*C//'
Output: AD

it works but, that's a bit non-intuitive, because normally one wants to
have UTF8-charset and sed to function correctly


Here's the contradiction: "to function correctly" also assumes the input
is valid (=valid UTF-8 stream, without invalid multibyte sequences).

Technically speaking, streams with invalid UTF-8 bytes are not text -
they are considered "binary" data.

And as you wrote, forcing "C" locale allows you to handle individualbyte values, including those that are invalid in other locales.

is there an
other regex similar to . that can recognize such characters?


By definition, invalid multibyte sequences are not "characters",
and so will not be recognized by sed's regular expression under
multibyte locales.

As such, I'm closing this as "not a bug" but discussion can continue byreplying to this thread.


regards,
 - assaf

[Prev in Thread]

Current Thread

[Next in Thread]

bug#37634: Non-charset characters are not recognized., sur3, 2019/10/05
- bug#37634: Non-charset characters are not recognized., Assaf Gordon <=

Prev by Date: bug#37634: Non-charset characters are not recognized.
Previous by thread: bug#37634: Non-charset characters are not recognized.
Index(es):
- Date
- Thread