inconsistency with counting characters vs bytes for multi-byte character

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

inconsistency with counting characters vs bytes for multi-byte character

From:	Ed Morton
Subject:	inconsistency with counting characters vs bytes for multi-byte characters
Date:	Thu, 31 Aug 2023 19:41:04 -0500
User-agent:	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc

Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security-Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong--param=ssp-buffer-size=4-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1-DNDEBUGuname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_642023-08-17 17:02 UTC x86_64 Cygwin

Machine Type: x86_64-pc-cygwin

Gawk Version: 5.2.2

Attestation 1:

I have readhttps://www.gnu.org/software/gawk/manual/html_node/Bugs.html.

        Yes

Attestation 2:
        I have not modified the sources before building gawk.
        True

Description:

Different string handling functions produce different resultsfor multi-byte characters.


Repeat-By:
        Without "-b":

$ awk 'BEGIN{str="\342\200\257"; print length(str);match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'

        1
        1
        4

Note that length() thinks that string is 1 character, the firstcall to match() agrees, but then the 2nd call to match() thinks it's 3characters (since RSTART tells us the "end of string" is at position 4).

Now with "-b" ("Cause gawk to treat all input data assingle-byte characters" perhttps://www.gnu.org/software/gawk/manual/gawk.html#Options):

$ awk -b 'BEGIN{str="\342\200\257"; print length(str);match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'

        3
        3
        4

Note that length() now thinks that string is 3 characters, thefirst call to match() agrees again, and then the 2nd call to match() nowalso agrees.

Per the manual "in gawk, length(), substr(), split(), match()and the other string functions ... all work in terms of characters inthe local character set, and not in terms of bytes." (fromhttps://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html)so I was expecting more consistent results between those 3 functioncalls and that they'd basically all always agree with length()s results.It may just be "match()" that has an issue, I haven't noticed a problemwith any other function but I haven't been looking for it.

[Prev in Thread]

Current Thread

[Next in Thread]

inconsistency with counting characters vs bytes for multi-byte characters, Ed Morton <=
- Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/08/31

Prev by Date: Re: Proof of Concept HTTP Client Function
Next by Date: Re: inconsistency with counting characters vs bytes for multi-byte characters
Previous by thread: Proof of Concept HTTP Client Function
Next by thread: Re: inconsistency with counting characters vs bytes for multi-byte characters
Index(es):
- Date
- Thread