[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: article about gawk best practices in data science and feature propos
From: |
Jean-Philippe Guérard |
Subject: |
Re: article about gawk best practices in data science and feature proposal |
Date: |
Thu, 11 Feb 2021 21:54:42 +0100 |
> Ivan Molineris <ivan.molineris@gmail.com> wrote:
> > Moreover, one of the biggest drawbacks of gawk in our field is the
> > fact that, indicating the columns of the input by numbers often
> > produces hard to read scripts.
> > For this reason in the wrapper I commonly use it is possible to
> > refer to columns not only by number, but also by name.
> >
> > For example, if a file is composed like this:
> >
> > chromosome start end
> > chr1 241 53521
> > chr1 363 43623
> > chr2 5243 234562
> >
> > gawk '{l=$2-$1}'
> > can be also written as
> > gawk '{l=$end-$start}'
This might be done by rewriting the command line to add the headers
values (by reading the first line of each file). The following small
library does just that:
------------- process_headers.awk -------------
@namespace "process_args"
BEGIN {
n = split("",newargs)
for(i=1;i<ARGC;i++){
file = ARGV[i]
if(file!~/=/){
l = split("",headers)
if((getline line < file)>0){
l = split(line,headers)
}
close(file)
for(j=1;j<=l;j++){
n++
newargs[n]= headers[j] "=" j
}
}
n++
newargs[n]=file
}
ARGC=length(newargs)+1
for(i=1;i<ARGC;i++){
ARGV[i] = newargs[i]
}
}
--------------------------------------------------
Then, the needed variables would be automatically added to the command
line:
gawk -i process_headers.awk 'FNR > 1 { print $start }' test.txt test2.txt
The arguments will be rewritten from:
test.txt test2.txt
To something like:
chromosome=1 start=2 end=3 test.txt chromosome=1 start=2 end=3 test2.txt
This might be limited by the numbers of columns you have, which might
overrun the maximum number of arguments (I have no idea what the limit
is). So it might not scale as you need.
HTH.
--
Jean-Philippe Guérard
https://tigrerayé.org