[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UTF8 above U+10FFFF treated inconsistently
From: |
Jason C. Kwan |
Subject: |
UTF8 above U+10FFFF treated inconsistently |
Date: |
Fri, 10 Sep 2021 04:14:46 +0000 (UTC) |
earliest specs of Unicode allows for up to 6-byte codes in UTF8. however,
Unicode consortium has amended that to only spec it up to U+10FFFF as of
Unicode 13, or 1,114,111 in decimal. The "invalid UTF8" in question is
\366\254\271\230, with a hypothetical Unicode integer value of 1,756,760, and
hex of U + 001A CE58
$ echo ' 06*(64^3) + 44*(64^2) + 57*(64^1) + 24*(64^0)' | bc1756760
$ echo 'obase=16; 1756760' | bc1ACE58
I'd imagine the issue might impact any hypothetical character from U+110000 to
end of 6-byte spec.
some parts of gawk is correct, such as failing ($0 ~ /^.*$/), failing match( )
on the same criteria, sprintf( ) spitting out one character at a time
(*sprintf("%.1s") dumps out the first item, either a multi-byte UTF8 character,
if it's a well-formed sequence, or just the first byte of any nature, ASCII or
8-bit), and also properly splitting it into 9-cell array in split( ).
i haven't tested gensub( ), but at least for sub( ) and gsub( ), it's showing
inconsistent treatment (row 14 below) - splitting it in 6 elements as if the 4
bytes \xF6 \xAC \xB9 \x98 together comprise a valid UTF8 character, and thus
resulting in length( ) properly error-ing out because gsub( ) provided the
illusion that the byte sequence is well-formed UTF8 that's safe for length( )
to directly measure.
$ gecho; time gcat backupgenieaudio_53128949med_.lossless.mp3 | gsed -n
'562368p' | gawkfx -e '{ s0=$0; k=strdump($0,"x"); gsub(/[%]/," \\x",k);
print strdump($0) " :: " k " :: valid end-to-end ? " ($0 ~ /^.*$/) } { bytes =
match($0, /$/); while (s0!="") { print sub("^"sprintf("%.1s",s0),"",s0) " --> "
strdump(s0) ; }; print (s0=="") } END { print NR, sum, bytes; print " s0 : ["
s0 "]"; orig0=$0; gsub(/./, "< & >"); gsub(/\037/, "\\037"); gsub(/[\000]/,
"\\000"); print ; print length(orig9) }' | gcat -n
gawk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected.
There may be a mismatch between your data and your locale. 1
\073\145\037\366\254\271\230\131\000 :: \x3B \x65 \x1F \xF6 \xAC \xB9 \x98
\x59 \x00 :: valid end-to-end ? 0 2 1 --> \145\037\366\254\271\230\131\000
3 1 --> \037\366\254\271\230\131\000 4 1 --> \366\254\271\230\131\000
5 1 --> \254\271\230\131\000 6 1 --> \271\230\131\000 7 1 -->
\230\131\000 8 1 --> \131\000 9 1 --> \000 10 1 --> 11 1 12
1==10 13 s0 : [] 14 < ; >< e >< \037 >< ???? >< Y >< \000 > 15 9
One can verify it via a regex constant that hard-codes in the UTF8-spec
(including explicitly skipping the high-and-low surrogates reserved for UTF16).
A properly formed sequence will result in 0 bytes reported after the gsub( )
instead of 4. :
gecho; time gcat backupgenieaudio_53128949med_.lossless.mp3 | gsed -n '562368p'
| gawk -e '{ print match($0,/$/) -1 ;
gsub(sprintf("[%c-%c%c-%c%c-%c]",0x00,0x7F,0x80,0xD7FF,0xE0000,0x10FFFF),"");
print match($0,/$/) -1 }'
94
- UTF8 above U+10FFFF treated inconsistently,
Jason C. Kwan <=