Re: article about gawk best practices in data science and feature propos

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: article about gawk best practices in data science and feature propos

From:	Jean-Philippe Guérard
Subject:	Re: article about gawk best practices in data science and feature proposal
Date:	Thu, 11 Feb 2021 21:54:42 +0100

> Ivan Molineris <ivan.molineris@gmail.com> wrote:
> > Moreover, one of the biggest drawbacks of gawk in our field is the
> > fact that, indicating the columns of the input by numbers often
> > produces hard to read scripts.
> > For this reason in the wrapper I commonly use it is possible to
> > refer to columns not only by number, but also by name.
> >
> > For example, if a file is composed like this:
> >
> > chromosome     start        end
> >       chr1       241      53521
> >       chr1       363      43623
> >       chr2      5243     234562
> >
> > gawk '{l=$2-$1}'
> > can be also written as
> > gawk '{l=$end-$start}'

This might be done by rewriting the command line to add the headers
values (by reading the first line of each file). The following small
library does just that:

------------- process_headers.awk -------------
@namespace "process_args"
BEGIN {
  n = split("",newargs)
  for(i=1;i<ARGC;i++){
    file = ARGV[i]
    if(file!~/=/){
      l = split("",headers)
      if((getline line < file)>0){
        l = split(line,headers)
      }
      close(file)
      for(j=1;j<=l;j++){
        n++
        newargs[n]= headers[j] "=" j
      }
    }
    n++
    newargs[n]=file
  }
  ARGC=length(newargs)+1
  for(i=1;i<ARGC;i++){
    ARGV[i] = newargs[i]
  }
}
--------------------------------------------------

Then, the needed variables would be automatically added to the command
line:

gawk -i process_headers.awk 'FNR > 1 { print $start }' test.txt test2.txt

The arguments will be rewritten from:

test.txt test2.txt

To something like:

chromosome=1 start=2 end=3 test.txt chromosome=1 start=2 end=3 test2.txt

This might be limited by the numbers of columns you have, which might
overrun the maximum number of arguments (I have no idea what the limit
is). So it might not scale as you need.

HTH.

-- 
Jean-Philippe Guérard
https://tigrerayé.org

[Prev in Thread]

Current Thread

[Next in Thread]

article about gawk best practices in data science and feature proposal, Ivan Molineris, 2021/02/11
- Re: article about gawk best practices in data science and feature proposal, arnold, 2021/02/11
  - Re: article about gawk best practices in data science and feature proposal, david kerns, 2021/02/11
  - Re: article about gawk best practices in data science and feature proposal, Manuel Collado, 2021/02/11
    - Re: article about gawk best practices in data science and feature proposal, Andrew J. Schorr, 2021/02/11
    - Re: article about gawk best practices in data science and feature proposal, Manuel Collado, 2021/02/11
  - Re: article about gawk best practices in data science and feature proposal, Jean-Philippe Guérard <=
- Re: article about gawk best practices in data science and feature proposal, Andrew J. Schorr, 2021/02/11

Prev by Date: Re: article about gawk best practices in data science and feature proposal
Next by Date: Re: article about gawk best practices in data science and feature proposal
Previous by thread: Re: article about gawk best practices in data science and feature proposal
Next by thread: Re: article about gawk best practices in data science and feature proposal
Index(es):
- Date
- Thread