bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk on


From: arnold
Subject: Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk only does that if FS is single char
Date: Sun, 21 Apr 2019 05:25:43 -0600
User-agent: Heirloom mailx 12.5 7/5/10

Hi Ed.

[ BCC to some other awk maintainers, for their interest, and action
  if necessary. ]

Ed Morton <address@hidden> wrote:

> I just came across this where setting RS to null causes FS to include 
> `\n` if FS is a singe char but not otherwise:
>
>     $ printf '1:2\n3\n' | awk -F':' -v RS= '{for (i=1; i<=NF; i++) print
>     i"/"NF, "<"$i">"}'
>     1/3 <1>
>     2/3 <2>
>     3/3 <3>
>
>     $ printf '1::2\n3\n' | awk -F'::' -v RS= '{for (i=1; i<=NF; i++)
>     print i"/"NF, "<"$i">"}'
>     1/2 <1>
>     2/2 <2
>     3>
>
> with this gawk version:
>
>     $ awk --version
>     GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
>     Copyright (C) 1989, 1991-2018 Free Software Foundation.
>
> and that makes sense given the gawk documentation 
> (https://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line) which 
> says (red/underline mine):
>
>     When RS is set to the empty string _/and /__FS is set to a single
>     character_, the newline character always acts as a field separator.
>     This is in addition to whatever field separations result from FS^

This is how Unix awk has behaved since the dawn of time, and how
gawk behaves.  I'm not going to change gawk; see below.

> but the POSIX spec (http://pubs.opengroup.org/onlinepubs/9699919799/) says:
>
>     *RS*
>         The first character of the string value of *RS* shall be the
>         input record separator; a <newline> by default. If *RS* contains
>         more than one character, the results are unspecified. If *RS* is
>         null, then records are separated by sequences consisting of a
>         <newline> plus one or more blank lines, leading or trailing
>         blank lines shall not result in empty records at the beginning
>         or end of the input, and a <newline> shall always be a field
>         separator, no matter what the value of *FS* is.
>
> gawk behaves the way I described with or without the `--posix` flag. 
> Shouldn't it add `\n` as a separator when RS is null regardless of the 
> value of FS like POSIX says? FWIW OSX/BSD awk on MacOS behaves the same 
> way that gawk does, idk about other awks.

The language in POSIX, "no matter what the value of FS is" has been there
since at least the 2004 standard. (I couldn't find anything older online).

In turn, that language is actually based on the Aho, Kernighan and Weinberger
book, pages 61 and 84, which say the same thing. (!)

As you note, it does imply that RS = "" should cause \n to be a separator
even if FS is regexp.

HOWEVER, the code in Unix awk (see https://github.com/onetrueawk/awk)
is more like this:

        if (FS is a regexp)
                do regexp field splitting
        else if (FS is " ")
                split on ' ', '\t', and '\n'
        else {
                split on other single character value of FS
                if (RS is null)
                        also split on '\n'
        }

Gawk is essentially the same, although how the code works is different.

Given that the existing practice dates back to at least 1987, over three
decades, I think that changing the code would be the wrong thing to do.

Instead, I will document this discrepancy, and work to get the standard
revised.

Thanks!

Arnold



reply via email to

[Prev in Thread] Current Thread [Next in Thread]