Arnold - thanks for responding. I don't agree that is clear as that section
doesn't state that the 3 possibilities are considered in that order, it sounds
like they would just be mutually exclusive but of course they aren't when it
come to RS=".", so what happens in gawk when the single char is a regexp is
ambiguous if that's the only statement about the behavior, but in any case I
didn't even look at the Summary section as I expected to find everything I
needed related to this in the main section, 4.1 How Input Is Split into Records
(https://www.gnu.org/software/gawk/manual/gawk.html#Records).
Since a Summary should be just that I'd have expect this particular information
in section 4.14 should be summarized from section 4.1, not additional to it.
What's stated in 4.14 is fine as a summary, but not adequate if it's the ONLY
source of info on this. It also doesn't explain how to get an RS that means "any
single character" and IMHO that is non-obvious (embarrassingly, I had to ask at
comp.lang.awk where Janis helped me wrap my head around it as I was drawing a
blank!).
I see now there's a clear statement of the related behavior for FS in section
4.5 Specifying How Fields Are Separated
(https://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators):
/If //|FS|//is any other single character, such as //|","|//, then each
occurrence of that character separates two fields. Two consecutive
occurrences delimit an empty field. If the character occurs at the
beginning
or the end of the line, that too delimits an empty field. The space
character is the only single character that does not follow these rules./
I think RS deserves the equivalent explanation in section 4.1 plus the example
of using an RS that's any char (FS doesn't need it since there's no equivalent
to RT that's be useful in this case and FPAT="." works as you'd expect so
there's no use case for FS="." as a regexp).
?????? Ed.
On 6/11/2018 1:07 AM, address@hidden wrote:
Hi Ed.
The behavior is stated clearly, if tersely, in the summary section in the
chapter
on reading input
(https://www.gnu.org/software/gawk/manual/html_node/Input-Summary.html#Input-Summary):
Input is split into records based on the value of RS. The possibilities
are as follows:
Value of RS Records are split on ??? awk /
gawk
Any single character That character awk
The empty string ("") Runs of two or more newlines awk
A regexp Text that matches the regexp gawk
Thanks,
Arnold
Ed Morton <address@hidden> wrote:
I was recently surprised by this behavior from gawk 4.2.0:
???? $ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
???? 1 <foo
???? :>
I came across this because I was trying to process data 1 char at a time and
thought setting RT to 1 char at a time might be a valid approach rather than
writing a loop. I'm not looking for alternatives, just wondering about this
specific functionality.
A little investigation shows that it behaves as if I'd used RS='[.]':
???? $ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
???? 1 <foo:.>
???? 2 <bar
???? :>
I expected that RT would take the values f, o, o, \n and every $0 would be the
null string, analogous to what happens when you use 2 "."s:
???? $ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
???? 1 <:fo>
???? 2 <:o
???? >
I assume it does this for compatibility with other awks where a single char RS
is always just that literal character but that seems counter-intuitive to the
way gawk uses RS as a regexp otherwise and idk how we're supposed to set the RS
to "any single character" given this implementation whereas if RS="." was
interpreted as a normal regexp then we could use `RS="[.]"` to get a literal "."
just like we do for it in any other regexp context.
I've since discovered that I can get the behavior I want with `RS=".{1}"` or
`RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and
non-intuitive.
I can't find anything in the gawk documentation that states that the above is
expected behavior. Assuming we can't update the code to treat RS="."?? as if "."
is a regexp metacharacter for backward compatibility, can we get a statement
saying something clear like "If RS is a single character it will be treated as a
literal character and not a regexp metacharacter" added to the documentation and
also the example of RS=".{1}" shown as a workaround for the case where the
desired regexp is "a single occurrence of any character"? I can't think of any
other regexp metacharacter that this issue would apply to.
???????? Ed.