bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: article about gawk best practices in data science and feature propos


From: Andrew J. Schorr
Subject: Re: article about gawk best practices in data science and feature proposal
Date: Thu, 11 Feb 2021 09:17:45 -0500
User-agent: Mutt/1.5.21 (2010-09-15)

Hi,

On Thu, Feb 11, 2021 at 10:53:19AM +0100, Ivan Molineris wrote:
> Moreover, one of the biggest drawbacks of gawk in our field is the fact
> that, indicating the columns of the input by numbers often produces hard to
> read scripts.
> For this reason in the wrapper I commonly use it is possible to refer to
> columns not only by number, but also by name.
> 
> For example, if a file is composed like this:
> 
> chromosome     start        end
>       chr1       241      53521
>       chr1       363      43623
>       chr2      5243     234562
> 
> gawk '{l=$2-$1}'
> can be also written as
> gawk '{l=$end-$start}'
> 
> I know that this syntax is not back-compatible, maybe can be improved.
> 
> Do you know if someone has reasoned about a feature like this one in the
> past?

Regarding this point: I often have files like this with a
header title row. I typically do something like this:

gawk '
NR == 1 {
  for (i = 1; i <= NF; i++)
    m[$i] = i
  # optional: check that all required columns are present
  next
}

{
  # to take your example
  l = $m["end"]-$m["start"]
}'

To me, this is more elegant than hardcoding

gawk -vstart=2 -vend=3 'NR > 1 {l = $end-$start}'

Regards,
Andy



reply via email to

[Prev in Thread] Current Thread [Next in Thread]