bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Numerical field labels and --header-in


From: Shawn Wagner
Subject: Re: Numerical field labels and --header-in
Date: Thu, 19 May 2022 22:52:57 -0700

Comparing a few different tools when the column names in a header are numeric:

$ cat foo.csv                    
2,1,3
a,b,c
$ datamash -H -t, cut 1 < foo.csv
cut(2)
a
$ mlr --csv cut -f 1 foo.csv
1
b
$ csvcut -c 1 foo.csv 
2
a

It's too bad miller and csvkit don't agree on how to interpret the field. I'm a little concerned that changing the current behavior to favor name over position will break backwards compatibility... hold off on switching the behavior for a 2.0 release?

Maybe we can come up with a notation to say "Even if this is an integer treat it as a string that names a column instead of a column number" for fields that can be used in ambiguous situations like this.




On Thu, May 19, 2022 at 8:46 PM Dima Kogan <datamash@dima.secretsauce.net> wrote:
I'm patching stuff and writing emails about things I find while adding
vnlog suport. Here's another.

As we know, 'datamash --header-in' will read header names from the first
record, and will accept these names in references. As I just found out
(and as I'm guessing most people reading this don't know), these named
references are optional, and the numerical field indices still work. Not
only that, the numerical field indices have precedence. So if you have
this data:

  0    1   2
  1.1 2.2 3.3

Then 'datamash --header-in sum 1' returns 1.1 and NOT 2.2. This sucks.
If header names are available, those thould be the only way to reference
fields.

If somebody's thinking that the above example is an error-prone way to
label fields, then I don't disagree, but people do it. I've actually
seen vnlog users do this more than once. And there are more legitimate
use cases where you could have an integer field label, anyway.

There's a fix in my tree:

  https://github.com/dkogan/datamash/commit/76080a51f2dda27734d32fbb6aae5b85f1530c5b

This isn't complete because it doesn't touch the tests, and there are
currently a lot of them that assume current behavior. What do we want to
do?

For vnlog support, the logic in that patch is a requirement, but the
patch can be adjusted to apply to vnlog only, and that won't break
anybody's existing usage.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]