[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Question on using gsub() on UTF-8 strings
From: |
pjfarley3 |
Subject: |
Question on using gsub() on UTF-8 strings |
Date: |
Sun, 23 Jun 2024 00:09:11 -0400 |
Environment: I am running the latest gawk 5.3.0-2 from ezwinports on an
up-to-date Win10 system and I do have the correct libgcc and stdc++
libraries in the same bin directory as the gawk executable.
gawk --version yields:
GNU Awk 5.3.0, API 4.0, (GNU MPFR 4.0.2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2023 Free Software Foundation.
I have a set of UTF-8 encoded text files where I need to do some character
analysis based on blank vs non-blank **characters** (not bytes).
To that end I used the following code to create vectors of zeroes and ones
corresponding to blank (actually "[:space:]") characters vs non-blank
characters.
gsub(/[^[:space:]]/, "1,", input_line)
gsub(/[[:space:]]/, "0,", input_line)
This works fine for pure-ASCII (single-byte) characters, but when presented
with UTF-8 multi-byte sequences, gsub seems to treat each byte of the UTF-8
characters as a separate character.
I trued using the Windows "chcp 65001" setting, and even tried these
environment settings in a "test.cmd" batch file:
setlocal
set LANGUAGE=en_US.UTF-8
set LANG=%LANGUAGE%
set LC_ALL=%LANGUAGE%
chcp 65001
(test gawk commands here)
endlocal
Nothing I have tried so far seems to make gsub treat the multi-byte
sequences as a single logical character.
Would someone please show me the correct magic invocation to make gsub
respect multi-byte UTF-8 sequences as single characters on a Windows
platform?
Test gawk program:
BEGIN {
# Note the first character is utf-8 \xE2\x80\xA2 = U+2022 BULLLET
character
utfstr = ". An "
nutf = gsub(/[^[:space:]]/, "1,", utfstr)
nutf += gsub(/[[:space:]]/, "0,", utfstr)
print "gsubutf=" nutf ",utfstr=" utfstr
ascstr = ". An " # here the BULLET is replaced by a period
nasc = gsub(/[^[:space:]]/, "1,", ascstr)
nasc += gsub(/[[:space:]]/, "0,", ascstr)
print "gsubasc=" nasc ",ascstr=" ascstr
getline # reads the test file with a BULLET character
ninp = gsub(/[^[:space:]]/, "1,", $0)
ninp += gsub(/[[:space:]]/, "0,", $0)
print "gsubinp=" ninp ",inpstr=" $0
exit
}
Test gawk input file (one line with leading BULLET and trailing blank
character):
. An
Test output:
gsubutf=8,utfstr=1,1,1,0,0,1,1,0,
gsubasc=6,ascstr=1,0,0,1,1,0,
gsubinp=8,inpstr=1,1,1,0,0,1,1,0,
Cmd.exe commands used to test:
gawk -f testgsub.awk testutf8.txt
gawk --posix -f testgsub.awk testutf8.txt
Nothing I have tried seems to work, so I appreciate any RTFM or help you can
offer.
Peter
- Question on using gsub() on UTF-8 strings,
pjfarley3 <=
- Re: Question on using gsub() on UTF-8 strings, Eli Zaretskii, 2024/06/23
- RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/23
- Re: Question on using gsub() on UTF-8 strings, Manuel Collado, 2024/06/23
- Re: Question on using gsub() on UTF-8 strings, Eli Zaretskii, 2024/06/23
- RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/23
- Re: Question on using gsub() on UTF-8 strings, Eli Zaretskii, 2024/06/24
- RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/25