Re: Question on using gsub() on UTF-8 strings

help-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Question on using gsub() on UTF-8 strings

From:	Eli Zaretskii
Subject:	Re: Question on using gsub() on UTF-8 strings
Date:	Sun, 23 Jun 2024 09:03:49 +0300

> From: <pjfarley3@earthlink.net>
> Date: Sun, 23 Jun 2024 00:09:11 -0400
> 
> Environment: I am running the latest gawk 5.3.0-2 from ezwinports on an
> up-to-date Win10 system and I do have the correct libgcc and stdc++
> libraries in the same bin directory as the gawk executable.
> 
> gawk --version yields:
> 
> GNU Awk 5.3.0, API 4.0, (GNU MPFR 4.0.2, GNU MP 6.1.2)
> Copyright (C) 1989, 1991-2023 Free Software Foundation.
> 
> I have a set of UTF-8 encoded text files where I need to do some character
> analysis based on blank vs non-blank **characters** (not bytes).
> 
> To that end I used the following code to create vectors of zeroes and ones
> corresponding to blank (actually "[:space:]") characters vs non-blank
> characters.
> 
>     gsub(/[^[:space:]]/, "1,", input_line)
>     gsub(/[[:space:]]/, "0,", input_line)
> 
> This works fine for pure-ASCII (single-byte) characters, but when presented
> with UTF-8 multi-byte sequences, gsub seems to treat each byte of the UTF-8
> characters as a separate character.
> 
> I trued using the Windows "chcp 65001" setting, and even tried these
> environment settings in a "test.cmd" batch file:
> 
> setlocal
> set LANGUAGE=en_US.UTF-8
> set LANG=%LANGUAGE%
> set LC_ALL=%LANGUAGE%
> chcp 65001
> (test gawk commands here)
> endlocal
> 
> Nothing I have tried so far seems to make gsub treat the multi-byte
> sequences as a single logical character.
> 
> Would someone please show me the correct magic invocation to make gsub
> respect multi-byte UTF-8 sequences as single characters on a Windows
> platform?

There's nothing you or I can do about this, unfortunately: Windows
doesn't (yet) support UTF-8 encoding in a way that would allow us to
make Gawk on Windows to be able to process UTF-8 encoded text.
Setting the codepage to 65001, the UTF-8 codepage, does not cause the
Windows library functions used by Gawk to be aware of UTF-8 multibyte
encodings.  Also, environment variables LANG and LC_ALL do not affect
Windows programs at all (unlike on Posix platforms).

So the only recommendation I have is to recode the text in some
single-byte codepage supported by Windows, preferably your system
codepage, and then Gawk should work.

Or do this on GNU/Linux.

Or use Emacs (which does support UTF-8 even on Windows).

Sorry.

[Prev in Thread]

Current Thread

[Next in Thread]

Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/23
- Re: Question on using gsub() on UTF-8 strings, Eli Zaretskii <=
  - RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/23
    - Re: Question on using gsub() on UTF-8 strings, Manuel Collado, 2024/06/23
    - Re: Question on using gsub() on UTF-8 strings, Eli Zaretskii, 2024/06/23
    - RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/23
    - Re: Question on using gsub() on UTF-8 strings, Eli Zaretskii, 2024/06/24
    - RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/25

Prev by Date: Question on using gsub() on UTF-8 strings
Next by Date: RE: Question on using gsub() on UTF-8 strings
Previous by thread: Question on using gsub() on UTF-8 strings
Next by thread: RE: Question on using gsub() on UTF-8 strings
Index(es):
- Date
- Thread