[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bracket expansions and "rational range" (was: bug#25048: --with-inc
Re: bracket expansions and "rational range" (was: bug#25048: --with-included-regex vs. e-acute...)
Wed, 30 Nov 2016 01:51:24 -0700
Heirloom mailx 12.4 7/29/08
Assaf Gordon <address@hidden> wrote:
> > We SHOULD be adjusting more and more GNU tools to honor rational range
> > behavior, at least as an option, even if that means that e-acute can
> > never be matched to [d-f].
> I'm working on the improving the sed manual,
> and just copied some parts from the grep manual.
> Specifically about section "bracket expansions":
> > In other locales, the sorting sequence is not specified, and ‘[a-d]’
> > might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might
> > fail to match any character, or the set of characters that it matches
> > might even be erratic. To obtain the traditional interpretation of
> > bracket expressions, you can use the ‘C’ locale by setting the LC_ALL
> > environment variable to the value ‘C’."
> Do you recommend rephrasing it in other ways, perhaps mentioning
> "Rational Range Interpretation" ?
There is text relating to this in the gawk manual. You may want to
peruse what that as to say and borrow what's appropriate.
> I should probably compile a list of combinations of os/libc/locale/gnulib
> under which sed does not behave with rational range. With the addition
> of the DFA engine (with fallback to the previous engine) it makes things
> ever more confusing (for me, at least).
So, for exactly this reason, gawk always uses the regex that comes
with its source code. This way I know that I will get consistent
results on ALL platforms.
The "price" of not using GLIBC regex is that gawk does not support
equivalence classes and collating sequences. In over two decades,
noone has ever complained. (:-)
In my opinion, sed should always use GNULIB regex, since it does
do Rational Range Interpretation. Advantages:
1. No out-of-sync interactions with dfa.c, which is also (and only!) RRI
2. Consistent behavior across all platforms (less confusing to you
and your users)
3. Less configuration machinery to have to maintain
My two cents.