bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk: Wrong behavior in binary mode


From: Aharon Robbins
Subject: Re: gawk: Wrong behavior in binary mode
Date: Mon, 15 Dec 2008 23:26:33 +0200

Eli,

Hi. Thanks for your note.

> From: Eli Zaretskii <address@hidden>
> To: Aharon Robbins <address@hidden>
> CC: address@hidden, address@hidden
> Subject: Re: gawk: Wrong behavior in binary mode
> Date: Thu, 11 Dec 2008 18:06:36 -0500
>
> > Date: Thu, 11 Dec 2008 05:39:23 +0200
> > From: Aharon Robbins <address@hidden>
> > Cc: address@hidden
> > 
> > Greetings. Re this:
> > 
> > > Date: Mon, 8 Dec 2008 23:27:51 -0200
> > > From: "Carlos G." <address@hidden>
> > > To: address@hidden
> > > Subject: gawk: Wrong behavior in binary mode
> > >
> > > Hi... I think this is a bug.
> > > When working with gawk in binary mode, the length() and index() built-ins
> > > fail with character codes greater than 127(0x7f). For example:
> > >
> > > ....
> > 
> > First, thank you very much for the bug report.
> > 
> > Second, it's not a BINMODE problem; rather it is a problem with locales;
> > the same behavior shows up under Linux which ignores BINMODE.
>
> I actually think that Carlos is right: if the user says she wants the
> bytes treated as bytes, Gawk should not try to treat them as multibyte
> character strings.
>
> I think the patch you posted in a followup is only partially correct:
> it will only work if the stream of bytes is not a valid multibyte
> string.  But what if by chance it is a valid string?  Solving this as
> you did gives unpredictable results, from the point of view of a user
> who does not necessarily know everything about valid and invalid
> multibyte strings.
>
> So I think there should be a way to tell Gawk "hands off my bytes!"
> BINMODE could be just that way (in which case Linux should not ignore
> it), or you can introduce a new variable.
>
> Btw, we've been through these issues in Emacs when Emacs 20 introduced
> multi-lingual support, and Emacs now has a special way of treating raw
> bytes that don't represent multibyte (or otherwise encoded) text.

You raise an interesting issue.

Adding to the BINMODE semantics does not appeal to me, since up to now
it has been concerned only with how to interpret line endings. The name
is perhaps a misnomer, but it's borrowed from mawk (where it was first
implemented) and I'd just as soon not change historical behavior nor
break compatibility with mawk.

The issue really is a locale issue; POSIX basically expects awk to
treat data as multibyte characters. I don't remember what it says when
the bytes don't make a valid string but the behavior with the patch
makes sense (at least to me!).

Of course, we all know that POSIX didn't get everything right and that
the Real World (tm) is a tougher place to live in, so the idea of
a way to tell awk (or any utility) "hands off my bytes" is appealing.

Fortunately, such a mechanism already exists, and is even standardized.
It's called

        export LC_ALL=C

A little-known fact is that it's possible to change the value of single
environment variables for just a single command run.  This is done by

        LC_ALL=C command arg ....

To demonstrate in our case:

        $ export LC_ALL=en_US.utf8
        $ gawk --version
        GNU Awk 3.1.6
        Copyright (C) 1989, 1991-2007 Free Software Foundation.
        ...............
        $ gawk 'BEGIN { print length("\x81\x82\x83\x84") }'
        0
        $ LC_ALL=C gawk 'BEGIN { print length("\x81\x82\x83\x84") }'
        4
        $ echo $LC_ALL
        en_US.utf8

LC_ALL=C could perhaps be documented more fully as a way to provide the
"hands off my bytes" feature, but otherwise I don't see a big need for
Yet Another Magic Variable or for a new command line option.

If I'm missing something really big and obvious, please let me know.

Thanks,

Arnold




reply via email to

[Prev in Thread] Current Thread [Next in Thread]