[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF8 above U+10FFFF treated inconsistently
From: |
arnold |
Subject: |
Re: UTF8 above U+10FFFF treated inconsistently |
Date: |
Mon, 27 Dec 2021 08:15:43 -0700 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Hello.
Please see https://lists.gnu.org/archive/html/bug-gawk/2021-09/msg00080.html
in which I state:
1. It's a libc issue, there is nothing I can do, and
2. Since you are on a Mac, you are NOT using the GNU C library.
You may wish to report a bug to the GLIBC developers; that would be
nice of you but it won't help you as long as you are usiing Mac OS.
I do not plan to try to report anything to the GLIBC developers;
doing so is a waste of time that I don't have to spare.
Please do not continue to report this issue.
Best wishes,
Arnold
"Jason C. Kwan" <jasonckwan@yahoo.com> wrote:
> Hi
>
> To follow up on the previous report, even in the latest version of gawk, i'm
> noticing the error. Here's the code for full replication of the issue. The
> test string, one top of valid ASCII and 2 valid unicode characters, one
> 2-byte one 3-byte, it also intentionally includes
>
> * a 4-byte sequence if U+110000 were a valid code point (it's 1 over the
> max of U+1FFFFF)
>
> * a 4-byte unicode-look-alike sequence, but definitely invalid as it
> beings with \366 it's hypothetical UTF8 code point would be U+ 1987FF,
> ord# 1,673,215
> * a 3-byte sequence that resides within the UTF-16 surrogates region
>
> * a 3rd extra continuation byte right after a valid 2-byte sequence
> * a 2-byte sequence supposedly to represent U+0088 | \xC2\x88 | \302\210 ,
> but intentionally uses earlier unicode-invalid byte of \300
> and finally,
> * a short-changed 3-byte sequence that is missing the second continuation
> byte
> The 2 valid multi-byte UTF8 code-points inside the string are U+076D (ARABIC
> LETTER SEEN WITH TWO DOTS VERTICALLY ABOVE) and U+B000 (Korean Hangul
> Syllable Ggwem)
> The test code (also included as attachment, along with my zsh's screen
> outputs) is included below
>
> A fully functional hex and octal encoder was included for your convenience.
> The correct # of characters is 17, as confirmed by gnu-wc. Byte count is 36.
> However, as you can see, each function reports something different, and
> frequently do not agree with each other.
> split() is now correct at 33
> however, length() and gsub(/./,"&") should be 17 instead of 33 and 20
> respectively
> as for match($0,/.*/), it's supposed to start at the first position of a
> valid code point, and should stop at the last valid codepoint that's
> contiguous from RSTART. if i'm counting it correctly, it should end at
> capital letter "A" ( \101 :: ord-67 :: \x42), so RSTART should be 1, and
> RLENGTH should be 7. However, currently reports 25
> if you run this command
> gsub( /.+/, "\f&\f") gsub(/[\f]+/, "\f")
> then it's obvious where gsub() is counting incorrectly :
> 4
> apple뀀A????XYZ????JR???zݭ ?
> F ?? Q
> ? W
> It's clumping multiple invalid code-points all within the first group.In
> another view :
> gsub(/.../,"\f&\f"); app le뀀 A????X YZ????
> JR??? zݭ?F??Q?W
> It's supposed to add the vertical form feeds only when it can find 3
> consecutive valid code points.
> between A and X isn't valid, so those should've have been grouped
> togetherditto for for the 3rd item after YZ, after JR, and the tail group is
> just clumped together.
> ::::====:::::======:::==== [ the code also self-prints on terminal, as well,
> as the test string, a full cell-by-cell display of what array splitting looks
> like, and finally, a full self printout of the hex and octal mapping tables
> to ensure they're accurately mapping the 8 bytes
>
>
> gprintf '\33c\e[3J'; echo;
> str1="apple\353\200\200A\364\220\200\200XYZ\366\230\237\277JR\355\271\272z\335\255\232F\300\210Q\343\207W";
> gprintf '%s\n' "test string :: ${str1}"; echo; gprintf "${str1}" | gwc -lcm
> ; echo ; cmd=' gprintf "${str1}" | gawk -e '\''function hexencode(str,chr) {
> for(chr in b2hex) { if (chr!~/[[:alnum:]%\\]/) { gsub(chr,b2hex[chr],str) }
> }; return str } function octencode(str,chr) { gsub(/\\/,b2oct["\\\\"],str);
> gsub(/[0-7]/,"\\06&",str); for(chr in b2oct) { if(chr!~/[0-7\\]/) {
> gsub(chr,b2oct[chr],str) str } }; return str } BEGIN {
> offset=-4^4;for(x=0;x<256;x++) {
> byte=sprintf("%c",x+offset);b2hex[byte]=sprintf("\\x%.2X",x);b2oct[byte]=sprintf("\\%03o",x)
> }; spc1="/\\^[]";spc2="~!@#%&_-{}:;\42\47\140 <>,$.|()*+=?";
> for(x=length(spc1);x;x--) { byte=substr(spc1,x,1);
> b2hex[("\\"(byte))]=b2hex[byte]; b2oct[("\\"(byte))]=b2oct[byte]; delete
> b2hex[byte]; delete b2oct[byte] }; for(x=length(spc2);x;x--) {
> byte=substr(spc2,x,1); b2hex[("["(byte)"]")]=b2hex[byte];
> b2oct[("["(byte)"]")]=b2oct[byte]; delete b2hex[byte]; delete b2oct[byte] } }
> function printtables() { PROCINFO["sorted_in"]="@val_num_asc";cnt=4; for(x in
> b2oct) { printf(" %-4s:%s:%s |%s",(x~/[\040-\176]/) ? x :
> "[.]",b2hex[x],b2oct[x],--cnt?"":ORS); if(!cnt) { cnt=4 } } } {
> printf("%cinput :: |%s|%c%c non-ALNUM-hex :: %s%c%cfull-octal :: %s%c%c", 10,
> $0, 10, 10, hexencode($0), 10, 10, octencode($0), 10, 10); print "byte count
> via match($0,/$/)-1 :: " , match($0,/$/)-1; print "gsub(/./,\"&\") :: " ,
> gsub(/./,"&"); match($0,/.*/); print "match($0,/.*/) :: ",RSTART, RLENGTH;
> print "length() :: ",length($0); print "split to array using empty-RE :: ",
> nx=split($0, arr, //); print ORS; print "($0~/^.+$/) :: " ($0~/^.+$/); print
> ORS; print "match($0,/.?$/) :: ",match($0,/.?$/); print ORS;
> for(x=1;x<=nx;x++) { printf("array cell # [ %2d ] <| %-6s | %16s | %16s
> |>\n", x, xa = arr[x], hexencode(xa), octencode(xa)); xa=""} } END {
> printtables() } '\'' 2>&1 | gcat -n ; echo; uname -a; echo; locale; echo;
> gawk -V; echo '; echo $'\n'"command is :: "$'\n'$'\n'"${cmd}"$'\n'; eval
> "${cmd}"; echo
> And the system config is :
>
>
>
> Darwin JCK-MBP18-Retina-13.local 20.6.0 Darwin Kernel Version 20.6.0: Mon Aug
> 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64 x86_64
> LANG="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_CTYPE="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_ALL=
> GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)Copyright (C) 1989,
> 1991-2021 Free Software Foundation.
>
>
> Thanks for your time Jason
>
> On Saturday, October 2, 2021, 02:09:46 AM EDT, Nethox <nethox+awk@gmail.com>
> wrote:
>
>
> 2021-09-29T21:29:55-06:00, <arnold@skeeve.com>:
> > Asserts are for errors in code, not errors in data. mbrlen() has to> return
> > an error to user code, not fail in an assertion.
> Yes. I meant the assert as a postcondition in the "recognized" cases,where
> glibc's full decoder/validator code should never reach with anyof those 13
> invalid bytes.
>