help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Question on using gsub() on UTF-8 strings


From: pjfarley3
Subject: Question on using gsub() on UTF-8 strings
Date: Sun, 23 Jun 2024 00:09:11 -0400

Environment: I am running the latest gawk 5.3.0-2 from ezwinports on an
up-to-date Win10 system and I do have the correct libgcc and stdc++
libraries in the same bin directory as the gawk executable.

 

gawk --version yields:

 

GNU Awk 5.3.0, API 4.0, (GNU MPFR 4.0.2, GNU MP 6.1.2)

Copyright (C) 1989, 1991-2023 Free Software Foundation.

 

I have a set of UTF-8 encoded text files where I need to do some character
analysis based on blank vs non-blank **characters** (not bytes).

 

To that end I used the following code to create vectors of zeroes and ones
corresponding to blank (actually "[:space:]") characters vs non-blank
characters.

 

    gsub(/[^[:space:]]/, "1,", input_line)

    gsub(/[[:space:]]/, "0,", input_line)

 

This works fine for pure-ASCII (single-byte) characters, but when presented
with UTF-8 multi-byte sequences, gsub seems to treat each byte of the UTF-8
characters as a separate character.

 

I trued using the Windows "chcp 65001" setting, and even tried these
environment settings in a "test.cmd" batch file:

 

setlocal

set LANGUAGE=en_US.UTF-8

set LANG=%LANGUAGE%

set LC_ALL=%LANGUAGE%

chcp 65001

(test gawk commands here)

endlocal

 

Nothing I have tried so far seems to make gsub treat the multi-byte
sequences as a single logical character.

 

Would someone please show me the correct magic invocation to make gsub
respect multi-byte UTF-8 sequences as single characters on a Windows
platform?

 

Test gawk program:

 

BEGIN {

    # Note the first character is utf-8 \xE2\x80\xA2 = U+2022 BULLLET
character

    utfstr = ".  An "

    nutf = gsub(/[^[:space:]]/, "1,", utfstr)

    nutf += gsub(/[[:space:]]/, "0,", utfstr)

    print "gsubutf=" nutf ",utfstr=" utfstr

    ascstr = ".  An " # here the BULLET is replaced by a period

    nasc = gsub(/[^[:space:]]/, "1,", ascstr)

    nasc += gsub(/[[:space:]]/, "0,", ascstr)

    print "gsubasc=" nasc ",ascstr=" ascstr

    getline  # reads the test file with a BULLET character

    ninp = gsub(/[^[:space:]]/, "1,", $0)

    ninp += gsub(/[[:space:]]/, "0,", $0)

    print "gsubinp=" ninp ",inpstr=" $0

    exit

}

 

Test gawk input file (one line with leading BULLET and trailing blank
character):

.  An 

 

Test output:

 

gsubutf=8,utfstr=1,1,1,0,0,1,1,0,

gsubasc=6,ascstr=1,0,0,1,1,0,

gsubinp=8,inpstr=1,1,1,0,0,1,1,0,

 

Cmd.exe commands used to test:

 

gawk -f testgsub.awk testutf8.txt

gawk --posix -f testgsub.awk testutf8.txt

 

Nothing I have tried seems to work, so I appreciate any RTFM or help you can
offer.

 

Peter

 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]