bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

gawk/POSIX regex metacharacter bug


From: Shawn Smout
Subject: gawk/POSIX regex metacharacter bug
Date: Sun, 28 Sep 2003 13:36:04 -0700 (PDT)

I am running Slackware 9.1 with Linux 2.4.22 on a
Pentium 4.  My gawk version is 3.1.3, and I am
reasonably certain it was compiled with gcc 3.2.3.

Gawk apparently handles metacharacters specially based
on context normally, but does not in POSIX
compatability mode.  This is not listed in the
documentation (info or man) as one of the POSIX/GNU
differences.

For this example, the file "file" contains one line:
    {s}

Ordinarily,
    gawk '/{.}/' file
will print:
    {s}
However,
    gawk --posix '/{.}/' file
fails with an invalid regular expression error. 
Apparently gawk normally decides based on context
whether the {} characters are metacharacters or
literal characters; since they are not valid as
metacharacters in this example, gawk interprets them
as literal characters.  In POSIX mode, gawk does not
change its interpretation of the metacharacters based
on context.

The correct POSIX awk syntax is
    awk '/\{.\}/' file
with the metacharacters escaped so they are
interpreted as literals.  This prints
    {s}
This syntax works in gawk in both normal and POSIX
modes.

The problem here is not the discrepancy between normal
and POSIX modes; I am fully aware that most such
discrepancies are deliberate.  However, this
particular one is not documented, which is a major
problem.  I discovered this gawk issue while compiling
third-party software (specifically, the ALSA drivers)
that uses gawk.  I had the POSIXLY_CORRECT environment
variable set, which causes gawk to behave in POSIX
mode, and the compilation failed; it took me a long
time to figure out why.  This problem may never have
existed if the discrepancy was documented; even if it
did exist, it would then become the fault of the
developers for either (a) not checking the
documentation and making sure their code was
compatible with either mode of gawk, or (b) not
informing the user that gawk needed to run in
non-compatible mode.  However, it was not documented,
so there was nothing the developers could have done
about it.

It is bad enough that so much GNU software allows lax
syntax like this.  Allowing context-based
interpretation of metacharacters doesn't add any
functionality at all, because the developer can always
escape the metacharacters to achieve the same result;
it only allows harmful ambiguity, which in turn causes
hard-to-find bugs that never should have been there to
start out with.  If we are ever to have good bug-free
code, we should try to eliminate ambiguity, not
promote it.  However, I would consider the ambiguity
tolerable in the software of others who choose to use
it, if it were documented properly.

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com




reply via email to

[Prev in Thread] Current Thread [Next in Thread]