sed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sed 4.7 little behaviour change on regexp interval


From: Assaf Gordon
Subject: Re: sed 4.7 little behaviour change on regexp interval
Date: Sun, 6 Jan 2019 23:50:24 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0

Hello,

On 2019-01-03 6:17 a.m., Lorenzo Gaggini wrote:
I recently upgrade to sed version 4.7 from sed 4.5 (I'm on Archlinux)
and I noticed a litte behaviour change on regexp interval.

there are actually few issues here:

First,
which regex implementation is being used?
It might sound like a silly question, but in fact there are at least two
'competing' implementations:
either your system's native glibc implementation (and then glibc
versions have slight differences), or the internal regex implementation
that comes bundled with sed and grep (and awk etc.).

On most modern gnu/linux systems, building sed and grep
would use the glibc one. But you can force using the internal one
using:

   ./configure --with-included-regex

The resulting binary might behave differently that one built without it.

The internal implementation is routinely sync'd with glibc,
so you could say (in theory) that glibc v2.28 is similar to the current
internal implementation.

But if you have an older glibc version, it might be different.

That's why just saying "grep 3.3" is not sufficient,
and similarly "sed 4.5" and "sed 4.7" are not - it depends on whether
they used the internal (gnulib) regex implementation or glibc's.

I suspect that if you built both sed 4.5 and 4.7 using
"--with-included-regex" you'll get the "Invalid range end" error in both versions.

sed -e 's/[^a-Z-]//g'

On new sed 4.7 this sed expression gives me an error:

sed: -e expression #1, char 12: Invalid range end

Second,
The issue here is that character "a" (ASCII 97) is larger than
"Z" (ASCII 90) - the start is larger than the end.
It's just as invalid as:

    grep '[F-A]'

Does that mean "A" to "F" (in the "just do what I meant" fashion),
or is this an invalid range?

Note that the range '[Z-a]' and '[A-z]' are both valid - the start is
smaller than the end character.

The current internal gnulib regex implementation forbids ranges such as
'[a-Z]'. I believe the most recent glibc behaves the same.

The problem is compounded by locales - basically,
in any locale except POSIX/C, ranges are poorly defined.
For example, does the range '[a-z]' includes "à" ?
(this topic is sometimes refers to as "rational ranges",
if you search for it you'll find lots of discussions in the gnulib
and glibc mailing lists).

In short, if you use ranges, use LC_ALL=C.
Otherwise, using '[[:alpha:]]' is recommended.


Hope this helps,
 - assaf




reply via email to

[Prev in Thread] Current Thread [Next in Thread]