[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Question on using gsub() on UTF-8 strings
From: |
Eli Zaretskii |
Subject: |
Re: Question on using gsub() on UTF-8 strings |
Date: |
Sun, 23 Jun 2024 09:03:49 +0300 |
> From: <pjfarley3@earthlink.net>
> Date: Sun, 23 Jun 2024 00:09:11 -0400
>
> Environment: I am running the latest gawk 5.3.0-2 from ezwinports on an
> up-to-date Win10 system and I do have the correct libgcc and stdc++
> libraries in the same bin directory as the gawk executable.
>
> gawk --version yields:
>
> GNU Awk 5.3.0, API 4.0, (GNU MPFR 4.0.2, GNU MP 6.1.2)
> Copyright (C) 1989, 1991-2023 Free Software Foundation.
>
> I have a set of UTF-8 encoded text files where I need to do some character
> analysis based on blank vs non-blank **characters** (not bytes).
>
> To that end I used the following code to create vectors of zeroes and ones
> corresponding to blank (actually "[:space:]") characters vs non-blank
> characters.
>
> gsub(/[^[:space:]]/, "1,", input_line)
> gsub(/[[:space:]]/, "0,", input_line)
>
> This works fine for pure-ASCII (single-byte) characters, but when presented
> with UTF-8 multi-byte sequences, gsub seems to treat each byte of the UTF-8
> characters as a separate character.
>
> I trued using the Windows "chcp 65001" setting, and even tried these
> environment settings in a "test.cmd" batch file:
>
> setlocal
> set LANGUAGE=en_US.UTF-8
> set LANG=%LANGUAGE%
> set LC_ALL=%LANGUAGE%
> chcp 65001
> (test gawk commands here)
> endlocal
>
> Nothing I have tried so far seems to make gsub treat the multi-byte
> sequences as a single logical character.
>
> Would someone please show me the correct magic invocation to make gsub
> respect multi-byte UTF-8 sequences as single characters on a Windows
> platform?
There's nothing you or I can do about this, unfortunately: Windows
doesn't (yet) support UTF-8 encoding in a way that would allow us to
make Gawk on Windows to be able to process UTF-8 encoded text.
Setting the codepage to 65001, the UTF-8 codepage, does not cause the
Windows library functions used by Gawk to be aware of UTF-8 multibyte
encodings. Also, environment variables LANG and LC_ALL do not affect
Windows programs at all (unlike on Posix platforms).
So the only recommendation I have is to recode the text in some
single-byte codepage supported by Windows, preferably your system
codepage, and then Gawk should work.
Or do this on GNU/Linux.
Or use Emacs (which does support UTF-8 even on Windows).
Sorry.
- Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/23
- Re: Question on using gsub() on UTF-8 strings,
Eli Zaretskii <=
- RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/23
- Re: Question on using gsub() on UTF-8 strings, Manuel Collado, 2024/06/23
- Re: Question on using gsub() on UTF-8 strings, Eli Zaretskii, 2024/06/23
- RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/23
- Re: Question on using gsub() on UTF-8 strings, Eli Zaretskii, 2024/06/24
- RE: Question on using gsub() on UTF-8 strings, pjfarley3, 2024/06/25