Arnold - thanks for responding. I don't agree that is clear as that
section doesn't state that the 3 possibilities are considered in
that order, it sounds like they would just be mutually exclusive but
of course they aren't when it come to RS=".", so what happens in
gawk when the single char is a regexp is ambiguous if that's the
only statement about the behavior, but in any case I didn't even
look at the Summary section as I expected to find everything I
needed related to this in the main section, 4.1 How Input Is Split
into Records
(https://www.gnu.org/software/gawk/manual/gawk.html#Records).
Since a Summary should be just that I'd have expect this particular
information in section 4.14 should be summarized from section 4.1,
not additional to it. What's stated in 4.14 is fine as a summary,
but not adequate if it's the ONLY source of info on this. It also
doesn't explain how to get an RS that means "any single character"
and IMHO that is non-obvious (embarrassingly, I had to ask at
comp.lang.awk where Janis helped me wrap my head around it as I was
drawing a blank!).
I see now there's a clear statement of the related behavior for FS
in section 4.5 Specifying How Fields Are Separated
(https://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators):
If FS is any other single
character, such as "," , then
each occurrence of that character separates two fields. Two
consecutive
occurrences delimit an empty field. If the character occurs at
the
beginning or the end of the line, that too delimits an empty
field. The
space character is the only single character that does not
follow these
rules.
I think RS deserves the equivalent explanation in section 4.1 plus
the example of using an RS that's any char (FS doesn't need it since
there's no equivalent to RT that's be useful in this case and
FPAT="." works as you'd expect so there's no use case for FS="." as
a regexp).
Ed.
Hi Ed.
The behavior is stated clearly, if tersely, in the summary section in the chapter
on reading input (https://www.gnu.org/software/gawk/manual/html_node/Input-Summary.html#Input-Summary):
Input is split into records based on the value of RS. The possibilities are as follows:
Value of RS Records are split on … awk / gawk
Any single character That character awk
The empty string ("") Runs of two or more newlines awk
A regexp Text that matches the regexp gawk
Thanks,
Arnold
Ed Morton <address@hidden> wrote:
I was recently surprised by this behavior from gawk 4.2.0:
$ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
1 <foo
:>
I came across this because I was trying to process data 1 char at a time and
thought setting RT to 1 char at a time might be a valid approach rather than
writing a loop. I'm not looking for alternatives, just wondering about this
specific functionality.
A little investigation shows that it behaves as if I'd used RS='[.]':
$ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
1 <foo:.>
2 <bar
:>
I expected that RT would take the values f, o, o, \n and every $0 would be the
null string, analogous to what happens when you use 2 "."s:
$ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
1 <:fo>
2 <:o
>
I assume it does this for compatibility with other awks where a single char RS
is always just that literal character but that seems counter-intuitive to the
way gawk uses RS as a regexp otherwise and idk how we're supposed to set the RS
to "any single character" given this implementation whereas if RS="." was
interpreted as a normal regexp then we could use `RS="[.]"` to get a literal "."
just like we do for it in any other regexp context.
I've since discovered that I can get the behavior I want with `RS=".{1}"` or
`RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and non-intuitive.
I can't find anything in the gawk documentation that states that the above is
expected behavior. Assuming we can't update the code to treat RS="." as if "."
is a regexp metacharacter for backward compatibility, can we get a statement
saying something clear like "If RS is a single character it will be treated as a
literal character and not a regexp metacharacter" added to the documentation and
also the example of RS=".{1}" shown as a workaround for the case where the
desired regexp is "a single occurrence of any character"? I can't think of any
other regexp metacharacter that this issue would apply to.
Ed.
|