bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

inconsistency with counting characters vs bytes for multi-byte character


From: Ed Morton
Subject: inconsistency with counting characters vs bytes for multi-byte characters
Date: Thu, 31 Aug 2023 19:41:04 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4 -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1 -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1 -DNDEBUG uname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_64 2023-08-17 17:02 UTC x86_64 Cygwin
Machine Type: x86_64-pc-cygwin

Gawk Version: 5.2.2

Attestation 1:
        I have read https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
        Yes

Attestation 2:
        I have not modified the sources before building gawk.
        True

Description:
        Different string handling functions produce different results for multi-byte characters.

Repeat-By:
        Without "-b":

        $ awk 'BEGIN{str="\342\200\257"; print length(str); match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
        1
        1
        4

        Note that length() thinks that string is 1 character, the first call to match() agrees, but then the 2nd call to match() thinks it's 3 characters (since RSTART tells us the "end of string" is at position 4).

        Now with "-b" ("Cause gawk to treat all input data as single-byte characters" per https://www.gnu.org/software/gawk/manual/gawk.html#Options):

        $ awk -b 'BEGIN{str="\342\200\257"; print length(str); match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
        3
        3
        4

        Note that length() now thinks that string is 3 characters, the first call to match() agrees again, and then the 2nd call to match() now also agrees.

        Per the manual "in gawk, length(), substr(), split(), match() and the other string functions ... all work in terms of characters in the local character set, and not in terms of bytes." (from https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html) so I was expecting more consistent results between those 3 function calls and that they'd basically all always agree with length()s results. It may just be "match()" that has an issue, I haven't noticed a problem with any other function but I haven't been looking for it.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]