Re: UTF8 above U+10FFFF treated inconsistently

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF8 above U+10FFFF treated inconsistently

From:	arnold
Subject:	Re: UTF8 above U+10FFFF treated inconsistently
Date:	Mon, 27 Dec 2021 08:15:43 -0700
User-agent:	Heirloom mailx 12.5 7/5/10

Hello.

Please see https://lists.gnu.org/archive/html/bug-gawk/2021-09/msg00080.html
in which I state:

1. It's a libc issue, there is nothing I can do, and
2. Since you are on a Mac, you are NOT using the GNU C library.

You may wish to report a bug to the GLIBC developers; that would be
nice of you but it won't help you as long as you are usiing Mac OS.

I do not plan to try to report anything to the GLIBC developers;
doing so is a waste of time that I don't have to spare.

Please do not continue to report this issue.

Best wishes,

Arnold

"Jason C. Kwan" <jasonckwan@yahoo.com> wrote:

> Hi
>
> To follow up on the previous report, even in the latest version of gawk, i'm 
> noticing the error. Here's the code for full replication of the issue. The 
> test string, one top of valid ASCII and 2 valid unicode characters, one 
> 2-byte one 3-byte, it also intentionally includes 
>
>     * a 4-byte sequence if U+110000 were a valid code point (it's 1 over the 
> max of U+1FFFFF)
>        
>     * a 4-byte unicode-look-alike sequence, but definitely invalid as it 
> beings with \366     it's hypothetical UTF8 code point would be U+ 1987FF, 
> ord# 1,673,215      
>     * a 3-byte sequence that resides within the UTF-16 surrogates region
>      
>     * a 3rd extra continuation byte right after a valid 2-byte sequence       
>  * a 2-byte sequence supposedly to represent U+0088 | \xC2\x88 | \302\210 , 
> but intentionally uses       earlier unicode-invalid byte of \300     
>            and finally, 
>     * a short-changed 3-byte sequence that is missing the second continuation 
> byte
> The 2 valid multi-byte UTF8 code-points inside the string are U+076D (ARABIC 
> LETTER SEEN WITH TWO DOTS VERTICALLY ABOVE) and U+B000 (Korean Hangul 
> Syllable Ggwem)
> The test code (also included as attachment, along with my zsh's screen 
> outputs) is included below
>
> A fully functional hex and octal encoder was included for your convenience. 
> The correct # of characters is 17, as confirmed by gnu-wc. Byte count is 36. 
> However, as you can see, each function reports something different, and 
> frequently do not agree with each other.
> split() is now correct at 33
> however, length() and gsub(/./,"&") should be 17 instead of 33 and 20 
> respectively
> as for match($0,/.*/), it's supposed to start at the first position of a 
> valid code point, and should stop at the last valid codepoint that's 
> contiguous from RSTART. if i'm counting it correctly, it should end at 
> capital letter "A" ( \101 :: ord-67 :: \x42), so RSTART should be 1, and 
> RLENGTH should be 7. However, currently reports 25
> if you run this command 
>  gsub(   /.+/, "\f&\f") gsub(/[\f]+/,    "\f")
> then it's obvious where gsub() is counting incorrectly  :
> 4
> apple뀀A????XYZ????JR???zݭ                          ?                          
>  F                            ??                              Q               
>                 ?                                W                            
>     It's clumping multiple invalid code-points all within the first group.In 
> another view :
> gsub(/.../,"\f&\f"); app   le뀀       A????X             YZ????                
>    JR???                        zݭ?F??Q?W                               
> It's supposed to add the vertical form feeds only when it can find 3 
> consecutive valid code points.
> between A and X isn't valid, so those should've have been grouped 
> togetherditto for for the 3rd item after YZ, after JR, and the tail group is 
> just clumped together.
> ::::====:::::======:::====  [ the code also self-prints on terminal, as well, 
> as the test string, a full cell-by-cell display of what array splitting looks 
> like, and finally, a full self printout of the hex and octal mapping tables 
> to ensure they're accurately mapping the 8 bytes
>
>
> gprintf '\33c\e[3J'; echo; 
> str1="apple\353\200\200A\364\220\200\200XYZ\366\230\237\277JR\355\271\272z\335\255\232F\300\210Q\343\207W";
>  gprintf '%s\n' "test string :: ${str1}"; echo; gprintf "${str1}" | gwc -lcm 
> ; echo ; cmd=' gprintf "${str1}" | gawk -e '\''function hexencode(str,chr) { 
> for(chr in b2hex) { if (chr!~/[[:alnum:]%\\]/) { gsub(chr,b2hex[chr],str) } 
> }; return str } function octencode(str,chr) { gsub(/\\/,b2oct["\\\\"],str); 
> gsub(/[0-7]/,"\\06&",str); for(chr in b2oct) { if(chr!~/[0-7\\]/) { 
> gsub(chr,b2oct[chr],str) str } }; return str } BEGIN { 
> offset=-4^4;for(x=0;x<256;x++) { 
> byte=sprintf("%c",x+offset);b2hex[byte]=sprintf("\\x%.2X",x);b2oct[byte]=sprintf("\\%03o",x)
>  }; spc1="/\\^[]";spc2="~!@#%&_-{}:;\42\47\140 <>,$.|()*+=?"; 
> for(x=length(spc1);x;x--) { byte=substr(spc1,x,1); 
> b2hex[("\\"(byte))]=b2hex[byte]; b2oct[("\\"(byte))]=b2oct[byte]; delete 
> b2hex[byte]; delete b2oct[byte] }; for(x=length(spc2);x;x--) { 
> byte=substr(spc2,x,1); b2hex[("["(byte)"]")]=b2hex[byte]; 
> b2oct[("["(byte)"]")]=b2oct[byte]; delete b2hex[byte]; delete b2oct[byte] } } 
> function printtables() { PROCINFO["sorted_in"]="@val_num_asc";cnt=4; for(x in 
> b2oct) { printf(" %-4s:%s:%s |%s",(x~/[\040-\176]/) ? x : 
> "[.]",b2hex[x],b2oct[x],--cnt?"":ORS); if(!cnt) { cnt=4 } } } { 
> printf("%cinput :: |%s|%c%c non-ALNUM-hex :: %s%c%cfull-octal :: %s%c%c", 10, 
> $0, 10, 10, hexencode($0), 10, 10, octencode($0), 10, 10); print "byte count 
> via match($0,/$/)-1 :: " , match($0,/$/)-1; print "gsub(/./,\"&\") :: " , 
> gsub(/./,"&"); match($0,/.*/); print "match($0,/.*/) :: ",RSTART, RLENGTH; 
> print "length() :: ",length($0); print "split to array using empty-RE :: ", 
> nx=split($0, arr, //); print ORS; print "($0~/^.+$/) :: " ($0~/^.+$/); print 
> ORS; print "match($0,/.?$/) :: ",match($0,/.?$/); print ORS; 
> for(x=1;x<=nx;x++) { printf("array cell # [ %2d ] <| %-6s | %16s | %16s 
> |>\n", x, xa = arr[x], hexencode(xa), octencode(xa)); xa=""} } END { 
> printtables() } '\'' 2>&1 | gcat -n ; echo; uname -a; echo; locale; echo; 
> gawk -V; echo '; echo $'\n'"command is :: "$'\n'$'\n'"${cmd}"$'\n'; eval 
> "${cmd}"; echo
> And the system config is :
>
>
>
> Darwin JCK-MBP18-Retina-13.local 20.6.0 Darwin Kernel Version 20.6.0: Mon Aug 
> 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64 x86_64
> LANG="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_CTYPE="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_ALL=
> GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)Copyright (C) 1989, 
> 1991-2021 Free Software Foundation.
>
>
> Thanks for your time Jason 
>
> On Saturday, October 2, 2021, 02:09:46 AM EDT, Nethox <nethox+awk@gmail.com> 
> wrote: 
>
>
> 2021-09-29T21:29:55-06:00, <arnold@skeeve.com>:
> > Asserts are for errors in code, not errors in data. mbrlen() has to> return 
> > an error to user code, not fail in an assertion.
> Yes. I meant the assert as a postcondition in the "recognized" cases,where 
> glibc's full decoder/validator code should never reach with anyof those 13 
> invalid bytes.
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: UTF8 above U+10FFFF treated inconsistently, Jason C. Kwan, 2021/12/26
- Re: UTF8 above U+10FFFF treated inconsistently, arnold <=

Prev by Date: Re: UTF8 above U+10FFFF treated inconsistently
Next by Date: small typo? in GNU Awk User’s Guide 5.1.1
Previous by thread: Re: UTF8 above U+10FFFF treated inconsistently
Next by thread: small typo? in GNU Awk User’s Guide 5.1.1
Index(es):
- Date
- Thread