bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: article about gawk best practices in data science and feature propos


From: david kerns
Subject: Re: article about gawk best practices in data science and feature proposal
Date: Thu, 11 Feb 2021 07:24:17 -0700

at least one of those articles suggested setting up dictionary:

NR==1 {split($0,dict);for(d in dict)newdict[dict[d]]=d;next} # I added the
gratuitous next

then using (coming back to your example)

$newdict["end"] - $newdict["start"]

I've done that, but to me, that becomes as unreadable as:
$3 - $2  # (fixing your example)
and way more work (for the interpreter) even though it allows for the named
columns to be in any order...
when you know your column names, just not their order, I typically do this:

NR==1   {
        for (i = 1; i <= NF; i++) {
          if ($i == "start") start = i; # I also always use ; to make my
awk read better
          if ($i == "end") end = i;
          # etc
        }
        next;
}

then you can use:
$end - $start
in the rest your code

hth

On Thu, Feb 11, 2021 at 3:27 AM <arnold@skeeve.com> wrote:

> Google searching 'using awk for data science' pulls up at least four
> interesting looking links...
>
> HTH,
>
> Arnold
>
> Ivan Molineris <ivan.molineris@gmail.com> wrote:
>
> > Hi all,
> > I start a new thread even if it is related to the "complie with mpfr
> > support" that I recently opened.
> >
> > I'm a bioinformatician and I use gawk in everyday work.
> > Me, as well as many other data scientist, use wrapper scripts to set by
> > default some variables, like -F'\t' -v OFS='\t'.
> >
> > I recently discovered that it is fundamental, in our work, to set also
> -M,
> > since we work often with number very close to 0 and we must avoid cases
> > like this:
> > $ echo 1.8e-308 | gawk '$1<0.05 {print "true"}'
> > that do not print "true" without -M
> >
> > Is there a good article about gawk best practices in data science?
> >
> > I would like to propose to the community a simple wrapper script that
> > implements such good practices, including e.g. the setting of -M, -F'\t',
> > -v OFS='\t'.
> >
> > Moreover, one of the biggest drawbacks of gawk in our field is the fact
> > that, indicating the columns of the input by numbers often produces hard
> to
> > read scripts.
> > For this reason in the wrapper I commonly use it is possible to refer to
> > columns not only by number, but also by name.
> >
> > For example, if a file is composed like this:
> >
> > chromosome     start        end
> >       chr1       241      53521
> >       chr1       363      43623
> >       chr2      5243     234562
> >
> > gawk '{l=$2-$1}'
> > can be also written as
> > gawk '{l=$end-$start}'
> >
> > I know that this syntax is not back-compatible, maybe can be improved.
> >
> > Do you know if someone has reasoned about a feature like this one in the
> > past?
> >
> > Best regards
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]