[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Unexpected results with RS="."
From: |
arnold |
Subject: |
Re: [bug-gawk] Unexpected results with RS="." |
Date: |
Mon, 11 Jun 2018 11:44:12 -0600 |
User-agent: |
Heirloom mailx 12.4 7/29/08 |
I have added a paragraph about this point and pushed it out to Git.
Thanks,
Arnold
Ed Morton <address@hidden> wrote:
> Arnold - thanks for responding. I don't agree that is clear as that section
> doesn't state that the 3 possibilities are considered in that order, it
> sounds
> like they would just be mutually exclusive but of course they aren't when it
> come to RS=".", so what happens in gawk when the single char is a regexp is
> ambiguous if that's the only statement about the behavior, but in any case I
> didn't even look at the Summary section as I expected to find everything I
> needed related to this in the main section, 4.1 How Input Is Split into
> Records
> (https://www.gnu.org/software/gawk/manual/gawk.html#Records).
>
> Since a Summary should be just that I'd have expect this particular
> information
> in section 4.14 should be summarized from section 4.1, not additional to it.
> What's stated in 4.14 is fine as a summary, but not adequate if it's the ONLY
> source of info on this. It also doesn't explain how to get an RS that means
> "any
> single character" and IMHO that is non-obvious (embarrassingly, I had to ask
> at
> comp.lang.awk where Janis helped me wrap my head around it as I was drawing a
> blank!).
>
> I see now there's a clear statement of the related behavior for FS in section
> 4.5 Specifying How Fields Are Separated
> (https://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators):
>
> /If //|FS|//is any other single character, such as //|","|//, then each
> occurrence of that character separates two fields. Two consecutive
> occurrences delimit an empty field. If the character occurs at the
> beginning
> or the end of the line, that too delimits an empty field. The space
> character is the only single character that does not follow these rules./
>
> I think RS deserves the equivalent explanation in section 4.1 plus the
> example
> of using an RS that's any char (FS doesn't need it since there's no
> equivalent
> to RT that's be useful in this case and FPAT="." works as you'd expect so
> there's no use case for FS="." as a regexp).
>
> ?????? Ed.
>
> On 6/11/2018 1:07 AM, address@hidden wrote:
> > Hi Ed.
> >
> > The behavior is stated clearly, if tersely, in the summary section in the
> > chapter
> > on reading input
> > (https://www.gnu.org/software/gawk/manual/html_node/Input-Summary.html#Input-Summary):
> >
> >
> > Input is split into records based on the value of RS. The possibilities
> > are as follows:
> >
> > Value of RS Records are split on ??? awk /
> > gawk
> > Any single character That character awk
> > The empty string ("") Runs of two or more newlines awk
> > A regexp Text that matches the regexp gawk
> >
> > Thanks,
> >
> > Arnold
> >
> >
> > Ed Morton <address@hidden> wrote:
> >
> >> I was recently surprised by this behavior from gawk 4.2.0:
> >>
> >> ???? $ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
> >> ???? 1 <foo
> >> ???? :>
> >>
> >> I came across this because I was trying to process data 1 char at a time
> >> and
> >> thought setting RT to 1 char at a time might be a valid approach rather
> >> than
> >> writing a loop. I'm not looking for alternatives, just wondering about this
> >> specific functionality.
> >>
> >> A little investigation shows that it behaves as if I'd used RS='[.]':
> >>
> >> ???? $ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
> >> ???? 1 <foo:.>
> >> ???? 2 <bar
> >> ???? :>
> >>
> >> I expected that RT would take the values f, o, o, \n and every $0 would be
> >> the
> >> null string, analogous to what happens when you use 2 "."s:
> >>
> >> ???? $ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
> >> ???? 1 <:fo>
> >> ???? 2 <:o
> >> ???? >
> >>
> >> I assume it does this for compatibility with other awks where a single
> >> char RS
> >> is always just that literal character but that seems counter-intuitive to
> >> the
> >> way gawk uses RS as a regexp otherwise and idk how we're supposed to set
> >> the RS
> >> to "any single character" given this implementation whereas if RS="." was
> >> interpreted as a normal regexp then we could use `RS="[.]"` to get a
> >> literal "."
> >> just like we do for it in any other regexp context.
> >>
> >> I've since discovered that I can get the behavior I want with `RS=".{1}"`
> >> or
> >> `RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and
> >> non-intuitive.
> >>
> >> I can't find anything in the gawk documentation that states that the above
> >> is
> >> expected behavior. Assuming we can't update the code to treat RS="."?? as
> >> if "."
> >> is a regexp metacharacter for backward compatibility, can we get a
> >> statement
> >> saying something clear like "If RS is a single character it will be
> >> treated as a
> >> literal character and not a regexp metacharacter" added to the
> >> documentation and
> >> also the example of RS=".{1}" shown as a workaround for the case where the
> >> desired regexp is "a single occurrence of any character"? I can't think of
> >> any
> >> other regexp metacharacter that this issue would apply to.
> >>
> >> ???????? Ed.