[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [IC/Bugs] Uninterpreted byte ranges in REs
From: |
Aharon Robbins |
Subject: |
Re: [IC/Bugs] Uninterpreted byte ranges in REs |
Date: |
Fri, 31 Oct 2008 14:19:55 +0200 |
Hi. Re this:
> Date: Thu, 30 Oct 2008 17:24:39 -0300
> From: Jorge Stolfi <address@hidden>
> To: address@hidden
> Subject: [IC/Bugs] Uninterpreted byte ranges in REs
>
> Dear gawk Maintaners,
>
> Not long ago, the intepretation of ranges like '[A-Z]' in gawk regular
> expressions was changed from plain byte-code order to locale-sensitive
> collating sequence order.
This is per POSIX; I do understand your problem though.
> While this change was probably welcome by many users, it unfortunately
> broke many existing scripts. Worse, there seems no decent way to get
> the old interpretation of RE ranges, in cases where it is needed.
>
> For example, here is some code from a script that used to work in
> 2005:
>
> # Remove funny characters:
> gsub(/[\001-\037\177-\240]/, " ", $0); # Controls, NBSP
>
> The version of "gawk" that I am using now (GNU Awk 3.1.5)
> complains
>
> gawk: myscript:12: fatal: Invalid collation character: /[-- ]/
>
> Here is a minimal command line that triggers that error message:
>
> gawk '/[\177-\240]/{ }'
>
> Here is a more meaningful example:
>
> echo "FOO @" | tr '@' '\203' | gawk '/[\177-\240]/{print;}'
>
> The error message gets printed when LANG=C and LC_ALL=C, and also when
> LANG=POSIX and LC_ALL=POSIX. The "--traditional" switch makes no
> difference.
This is suprising. In particular, it works as expected for me
under Linux. How are you setting LC_ALL? If you're using Bash,
try
export LC_ALL=C
as a standalone statement and then run your test.
What kind of system are you using? Linux, or some other Unix variant?
If so, how was gawk compiled?
> In general, the only way to get the old semantics seems to be to list all
> octal codes in the desired range:
>
> ( echo "FOO @"; echo "BAR %" ) | tr '@%' '\177\203' |
> gawk
> '/[\177\200\201\202\203\204\205\206\207\210\211...\237\240]/{print;}'
This works but it should not be necessary if you use LC_ALL=C.
I would prefer to find out why LC_ALL=C isn't working for you before trying
to modify gawk.
Thanks,
Arnold