bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature: expose POSITION of parsed columns (fields) as a variable/fu


From: david kerns
Subject: Re: feature: expose POSITION of parsed columns (fields) as a variable/function?
Date: Thu, 13 May 2021 06:54:52 -0700

You're thinking about awk internals incorrectly... consider the following:

echo "    1     2 3" | awk '{
s=$0; # first save the input record, by now awk internals has already parsed
      # the record into numbered fields losing any repeated white space
delimiter,
      # unless FS was previously explicitly set
print $0; # show the record as is
$2="z"; # ANYTIME $<number> appears on the LHS of an equals,
        # awk internally rebuilds $0 using OFS
print $0; # show $0 reconstructed with OFS
$0=s; # restore $0, also causes awk to internally reparse the record
      # into numbered fields, again, using FS
print $0; # show restored line
}'
    1     2 3
1 z 3
    1     2 3

$2 is not a pointer into the record $0, it is a stand alone object.

These properties are all fundamental to awk and why it's performance on
very large data sets makes the language thrive today some 40+ years since
it was written.


On Thu, May 13, 2021 at 6:15 AM Vla D <dubovo@gmail.com> wrote:

> Example 1: try to modify a (first) column and preserve the remaining
> columns intact - it ruins the formatting of the remaining columns by
> re-joining those via OFS:
>
> > $ echo -e " a  be   c  d  e" | awk '$1="A"'
> > A be c d e
>
> if awk would've had a function/variable telling me that the 2nd column "be"
> starts at 5th character of $0 - I could've done it easily (preserving all
> spaces/tabs/etc):
>
> > $ echo -e " a  be   c  d  e" | awk '{print "A
> "substr($0,FPOS[2])}BEGIN{FPOS[2]=5}'
> > A be   c  d  e
>
> Example 2: try to join data from columns 3-N by keys stored in columns 1
> and 2 without losing formatting:
>
> > $ echo -e '1 k1   g o d   i s   n o w   h e r e !\n2 k1   p e n   i s   b
> r o k e n     !'
> > 1 k1   g o d   i s   n o w   h e r e !
> > 2 k1   p e n   i s   b r o k e n     !
> >
> > $ echo -e '1 k1   g o d   i s   n o w   h e r e !\n2 k1   p e n   i s   b
> r o k e n     !' \
> > | awk '{file=$1;key=$2;$1=$2="";data=$0;O[key]=O[key]data}END{print
> O["k1"]}'
> >   g o d i s n o w h e r e !  p e n i s b r o k e n !
>
> if awk would've had a function/variable telling me that the 3rd column ("g"
> on 1st line or "p" on 2nd line of input) starts at 8th character:
>
> > $ echo -e '1 k1   g o d   i s   n o w   h e r e !\n2 k1   p e n   i s   b
> r o k e n     !' \
> > | awk '{data=substr($0,FPOS[3]);O[$2]=O[$2]" "data}END{print
> O["k1"]}BEGIN{FPOS[3]=8}'
> >  g o d   i s   n o w   h e r e ! p e n   i s   b r o k e n     !
>
> in order to properly "calculate" the position of that 8th character
> on-the-fly - there's quite a big computational overhead (see workarounds
> and the performance of those):
>
>
> https://unix.stackexchange.com/questions/649159/awk-how-can-i-tell-where-column-begins
>
> the odd thing is that in order to access the column $3 - the awk itself has
> to find that exact position, and it KNOWS that it's the 8th character of
> $0, it just doesn't tell us (doesn't expose this valuable information via
> any variable/function) :(
>
> any chance to have this feature? I can add more useful examples (all to
> access a portion of $0 by column-number WITHOUT ruining the formatting of
> the input)
>
> OffTopic1: if awk would've allowed to treat strings as
> pointers-to-first-char - we could've just calculated `1 + $3 - $0` (one
> plus mem address of 8th char minus mem address of 1st char) = 8, but this
> is from non-awk universe
>
> OffTopic2: if split($0,flds,FS,seps) could've been made "lazy", e.g. to
> only do the actual parsing only up to the field at the moment when we use
> the filed (flds[3]) - this might've added enough performance to the
> workarounds of original issue, BUT this sounds waaay more complex to
> implement than just exposing the desired value as a function....
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]