bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: inconsistency with counting characters vs bytes for multi-byte chara


From: arnold
Subject: Re: inconsistency with counting characters vs bytes for multi-byte characters
Date: Thu, 31 Aug 2023 22:28:27 -0600
User-agent: Heirloom mailx 12.5 7/5/10

Hi Ed.

This was a really interesting corner case. Good catch. The fix
is attached and will be in git eventually.

Thanks for the report!

Arnold

Ed Morton <mortoneccc@comcast.net> wrote:

> Configuration Information [Automatically generated, do not change]:
> Machine: x86_64
> OS: cygwin
> Compiler: gcc
> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security 
> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong 
> --param=ssp-buffer-size=4 
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1
>  
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1
>  
> -DNDEBUG
> uname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_64 
> 2023-08-17 17:02 UTC x86_64 Cygwin
> Machine Type: x86_64-pc-cygwin
>
> Gawk Version: 5.2.2
>
> Attestation 1:
>          I have read 
> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
>          Yes
>
> Attestation 2:
>          I have not modified the sources before building gawk.
>          True
>
> Description:
>          Different string handling functions produce different results 
> for multi-byte characters.
>
> Repeat-By:
>          Without "-b":
>
>          $ awk 'BEGIN{str="\342\200\257"; print length(str); 
> match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
>          1
>          1
>          4
>
>          Note that length() thinks that string is 1 character, the first 
> call to match() agrees, but then the 2nd call to match() thinks it's 3 
> characters (since RSTART tells us the "end of string" is at position 4).
>
>          Now with "-b" ("Cause gawk to treat all input data as 
> single-byte characters" per 
> https://www.gnu.org/software/gawk/manual/gawk.html#Options):
>
>          $ awk -b 'BEGIN{str="\342\200\257"; print length(str); 
> match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
>          3
>          3
>          4
>
>          Note that length() now thinks that string is 3 characters, the 
> first call to match() agrees again, and then the 2nd call to match() now 
> also agrees.
>
>          Per the manual "in gawk, length(), substr(), split(), match() 
> and the other string functions ... all work in terms of characters in 
> the local character set, and not in terms of bytes." (from 
> https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html)
>  
> so I was expecting more consistent results between those 3 function 
> calls and that they'd basically all always agree with length()s results. 
> It may just be "match()" that has an issue, I haven't noticed a problem 
> with any other function but I haven't been looking for it.

Attachment: fix.diff
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]