[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gawk] Unexpected results with RS="."
From: |
Ed Morton |
Subject: |
[bug-gawk] Unexpected results with RS="." |
Date: |
Sun, 10 Jun 2018 11:28:12 -0500 |
User-agent: |
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 |
I was recently surprised by this behavior from gawk 4.2.0:
$ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
1 <foo
:>
I came across this because I was trying to process data 1 char at a time and
thought setting RT to 1 char at a time might be a valid approach rather than
writing a loop. I'm not looking for alternatives, just wondering about this
specific functionality.
A little investigation shows that it behaves as if I'd used RS='[.]':
$ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
1 <foo:.>
2 <bar
:>
I expected that RT would take the values f, o, o, \n and every $0 would be the
null string, analogous to what happens when you use 2 "."s:
$ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
1 <:fo>
2 <:o
>
I assume it does this for compatibility with other awks where a single char RS
is always just that literal character but that seems counter-intuitive to the
way gawk uses RS as a regexp otherwise and idk how we're supposed to set the RS
to "any single character" given this implementation whereas if RS="." was
interpreted as a normal regexp then we could use `RS="[.]"` to get a literal "."
just like we do for it in any other regexp context.
I've since discovered that I can get the behavior I want with `RS=".{1}"` or
`RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and non-intuitive.
I can't find anything in the gawk documentation that states that the above is
expected behavior. Assuming we can't update the code to treat RS="." as if "."
is a regexp metacharacter for backward compatibility, can we get a statement
saying something clear like "If RS is a single character it will be treated as a
literal character and not a regexp metacharacter" added to the documentation and
also the example of RS=".{1}" shown as a workaround for the case where the
desired regexp is "a single occurrence of any character"? I can't think of any
other regexp metacharacter that this issue would apply to.
Ed.
- [bug-gawk] Unexpected results with RS=".",
Ed Morton <=