bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk on


From: ED MORTON
Subject: Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk only does that if FS is single char
Date: Sun, 21 Apr 2019 07:56:02 -0500 (CDT)

Arnold - sounds good, thanks for looking into it and I think that's the best 
approach since with the way gawk works today I can do

$ printf '1:2\n3\n' | awk -F'[:]' -v RS= '{for (i=1; i<=NF; i++) print i"/"NF, 
"<"$i">"}'
1/2 <1>
2/2 <2
3>

whenever I want a single char FS without "\n" getting added. If I couldn't do 
the above then having to REMOVE an automatically added "\n" when I really only 
wanted ":" as the FS would be a significant pain (off the top of my head I 
actually can't think of a way to do it that doesn't involve converting "\n"s in 
the record to some control-char that I just hope isn't in the input and then 
re-splitting the record and then restoring the "\n"s or manually writing a 
"while(index($0,FS))substr()s" loop or similar to identify the fields!).

To be honest I'd rather awk simply NEVER added "\n" to FS when RS="" since it's 
non-intuitive that that'd happen and it's so trivial to simply add "\n" to FS 
if I want it but I don't expect that behavior to change now and the way gawk 
works today provides a simple workaround, so I agree that just documenting the 
way gawk works and trying to get the standard changed is the way to go.

    Ed.

> On April 21, 2019 at 6:25 AM address@hidden wrote:
> 
> 
> Hi Ed.
> 
> [ BCC to some other awk maintainers, for their interest, and action
>   if necessary. ]
> 
> Ed Morton <address@hidden> wrote:
> 
> > I just came across this where setting RS to null causes FS to include 
> > `\n` if FS is a singe char but not otherwise:
> >
> >     $ printf '1:2\n3\n' | awk -F':' -v RS= '{for (i=1; i<=NF; i++) print
> >     i"/"NF, "<"$i">"}'
> >     1/3 <1>
> >     2/3 <2>
> >     3/3 <3>
> >
> >     $ printf '1::2\n3\n' | awk -F'::' -v RS= '{for (i=1; i<=NF; i++)
> >     print i"/"NF, "<"$i">"}'
> >     1/2 <1>
> >     2/2 <2
> >     3>
> >
> > with this gawk version:
> >
> >     $ awk --version
> >     GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
> >     Copyright (C) 1989, 1991-2018 Free Software Foundation.
> >
> > and that makes sense given the gawk documentation 
> > (https://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line) which 
> > says (red/underline mine):
> >
> >     When RS is set to the empty string _/and /__FS is set to a single
> >     character_, the newline character always acts as a field separator.
> >     This is in addition to whatever field separations result from FS^
> 
> This is how Unix awk has behaved since the dawn of time, and how
> gawk behaves.  I'm not going to change gawk; see below.
> 
> > but the POSIX spec (http://pubs.opengroup.org/onlinepubs/9699919799/) says:
> >
> >     *RS*
> >         The first character of the string value of *RS* shall be the
> >         input record separator; a <newline> by default. If *RS* contains
> >         more than one character, the results are unspecified. If *RS* is
> >         null, then records are separated by sequences consisting of a
> >         <newline> plus one or more blank lines, leading or trailing
> >         blank lines shall not result in empty records at the beginning
> >         or end of the input, and a <newline> shall always be a field
> >         separator, no matter what the value of *FS* is.
> >
> > gawk behaves the way I described with or without the `--posix` flag. 
> > Shouldn't it add `\n` as a separator when RS is null regardless of the 
> > value of FS like POSIX says? FWIW OSX/BSD awk on MacOS behaves the same 
> > way that gawk does, idk about other awks.
> 
> The language in POSIX, "no matter what the value of FS is" has been there
> since at least the 2004 standard. (I couldn't find anything older online).
> 
> In turn, that language is actually based on the Aho, Kernighan and Weinberger
> book, pages 61 and 84, which say the same thing. (!)
> 
> As you note, it does imply that RS = "" should cause \n to be a separator
> even if FS is regexp.
> 
> HOWEVER, the code in Unix awk (see https://github.com/onetrueawk/awk)
> is more like this:
> 
>       if (FS is a regexp)
>               do regexp field splitting
>       else if (FS is " ")
>               split on ' ', '\t', and '\n'
>       else {
>               split on other single character value of FS
>               if (RS is null)
>                       also split on '\n'
>       }
> 
> Gawk is essentially the same, although how the code works is different.
> 
> Given that the existing practice dates back to at least 1987, over three
> decades, I think that changing the code would be the wrong thing to do.
> 
> Instead, I will document this discrepancy, and work to get the standard
> revised.
> 
> Thanks!
> 
> Arnold



reply via email to

[Prev in Thread] Current Thread [Next in Thread]