[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
inconsistency with counting characters vs bytes for multi-byte character
From: |
Ed Morton |
Subject: |
inconsistency with counting characters vs bytes for multi-byte characters |
Date: |
Thu, 31 Aug 2023 19:41:04 -0500 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 |
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
--param=ssp-buffer-size=4
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1
-DNDEBUG
uname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_64
2023-08-17 17:02 UTC x86_64 Cygwin
Machine Type: x86_64-pc-cygwin
Gawk Version: 5.2.2
Attestation 1:
I have read
https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
Yes
Attestation 2:
I have not modified the sources before building gawk.
True
Description:
Different string handling functions produce different results
for multi-byte characters.
Repeat-By:
Without "-b":
$ awk 'BEGIN{str="\342\200\257"; print length(str);
match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
1
1
4
Note that length() thinks that string is 1 character, the first
call to match() agrees, but then the 2nd call to match() thinks it's 3
characters (since RSTART tells us the "end of string" is at position 4).
Now with "-b" ("Cause gawk to treat all input data as
single-byte characters" per
https://www.gnu.org/software/gawk/manual/gawk.html#Options):
$ awk -b 'BEGIN{str="\342\200\257"; print length(str);
match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
3
3
4
Note that length() now thinks that string is 3 characters, the
first call to match() agrees again, and then the 2nd call to match() now
also agrees.
Per the manual "in gawk, length(), substr(), split(), match()
and the other string functions ... all work in terms of characters in
the local character set, and not in terms of bytes." (from
https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html)
so I was expecting more consistent results between those 3 function
calls and that they'd basically all always agree with length()s results.
It may just be "match()" that has an issue, I haven't noticed a problem
with any other function but I haven't been looking for it.
- inconsistency with counting characters vs bytes for multi-byte characters,
Ed Morton <=