bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ^ in FS


From: Stepan Kasal
Subject: Re: ^ in FS
Date: Wed, 26 Nov 2008 13:25:40 +0100
User-agent: Mutt/1.5.18 (2008-05-17)

Hello,

On Tue, Nov 25, 2008 at 08:29:04PM +0100, Dave B wrote:
> This is actually another good example. [...]
> GNU awk 3.1.6: [...] (imho wrong, [...])
> Bell labs' original awk: [...] (imho correct)

agreed.

> However, I can't find the precise circumstances [...]

I'm afraid I have not selected the best examples; thank you for
sending your ones.  Asking the right question is bigger part of the
explanation.  ;-)

> $ echo 'XXf1XXf2XXf3' | awk -v FS='^X+' '{for(i=1;i<=NF;i++)print 
> "-->"$i"<--"}'
> --><--
> -->f1XXf2XXf3<--

- FS regex is matched against "XXf1XXf2XXf3"; the result is the "XX"
  at the beginning.
- The first field ("") and delimiter ("XX") are stripped.
- FS regex is matched against the remainder ("f1XXf2XXf3"); no match.
- hence the whole string becomes $2

> $ echo 'XXf1XXf2XXf3' | gawk -v FS='^X+|k*' '{for(i=1;i<=NF;i++)print
> "-->"$i"<--"}'
> --><--
> -->f1<--
> -->f2<--
> -->f3<--

- FS regex is matched against "XXf1XXf2XXf3"; the result is the "XX"
  at the beginning.
- The first field ("") and delimiter ("XX") are stripped.
- FS regex is matched against the remainder ("f1XXf2XXf3"); the
  leftmost longest match is the empty string at position 0.
- But empty delimiter is not allowed, so it is dismissed and the FS
  regex is matched against "1XXf2XXf3" (one char skipped);
  again, the match is the empty string at position 0.
- But empty delimiter is not allowed, so it is dismissed and the FS
  regex is matched against "XXf2XXf3" (one char skipped);
  the leftmost longest match is "XX".
- Consequently, "f1" becomes $2.
- $2 and the delimiter ("XX") get stripped.
- FS regex is matched against the remainder ("f2XXf3");
  etc.

I hope this explains _what_ gawk does.

A note: the "skip one char" step is correct is the regex does not
use ^, word boundary esacpes and such.  In that case, if the leftmost
longest match is empty, there cannot be longer match from the
beginning, so we are searching for a match at the next position.

So if you are not using ^ or word boundary escapes in FS, gawk's
field splitting is correct.

Have a nice day,
        Stepan Kasal




reply via email to

[Prev in Thread] Current Thread [Next in Thread]