help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Question on using gsub() on UTF-8 strings


From: Eli Zaretskii
Subject: Re: Question on using gsub() on UTF-8 strings
Date: Sun, 23 Jun 2024 09:03:49 +0300

> From: <pjfarley3@earthlink.net>
> Date: Sun, 23 Jun 2024 00:09:11 -0400
> 
> Environment: I am running the latest gawk 5.3.0-2 from ezwinports on an
> up-to-date Win10 system and I do have the correct libgcc and stdc++
> libraries in the same bin directory as the gawk executable.
> 
> gawk --version yields:
> 
> GNU Awk 5.3.0, API 4.0, (GNU MPFR 4.0.2, GNU MP 6.1.2)
> Copyright (C) 1989, 1991-2023 Free Software Foundation.
> 
> I have a set of UTF-8 encoded text files where I need to do some character
> analysis based on blank vs non-blank **characters** (not bytes).
> 
> To that end I used the following code to create vectors of zeroes and ones
> corresponding to blank (actually "[:space:]") characters vs non-blank
> characters.
> 
>     gsub(/[^[:space:]]/, "1,", input_line)
>     gsub(/[[:space:]]/, "0,", input_line)
> 
> This works fine for pure-ASCII (single-byte) characters, but when presented
> with UTF-8 multi-byte sequences, gsub seems to treat each byte of the UTF-8
> characters as a separate character.
> 
> I trued using the Windows "chcp 65001" setting, and even tried these
> environment settings in a "test.cmd" batch file:
> 
> setlocal
> set LANGUAGE=en_US.UTF-8
> set LANG=%LANGUAGE%
> set LC_ALL=%LANGUAGE%
> chcp 65001
> (test gawk commands here)
> endlocal
> 
> Nothing I have tried so far seems to make gsub treat the multi-byte
> sequences as a single logical character.
> 
> Would someone please show me the correct magic invocation to make gsub
> respect multi-byte UTF-8 sequences as single characters on a Windows
> platform?

There's nothing you or I can do about this, unfortunately: Windows
doesn't (yet) support UTF-8 encoding in a way that would allow us to
make Gawk on Windows to be able to process UTF-8 encoded text.
Setting the codepage to 65001, the UTF-8 codepage, does not cause the
Windows library functions used by Gawk to be aware of UTF-8 multibyte
encodings.  Also, environment variables LANG and LC_ALL do not affect
Windows programs at all (unlike on Posix platforms).

So the only recommendation I have is to recode the text in some
single-byte codepage supported by Windows, preferably your system
codepage, and then Gawk should work.

Or do this on GNU/Linux.

Or use Emacs (which does support UTF-8 even on Windows).

Sorry.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]