bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Fixed incomplete and incorrect treatment of comments and tra


From: Erik Auerswald
Subject: Re: [PATCH] Fixed incomplete and incorrect treatment of comments and trailing whitespace
Date: Sun, 29 May 2022 15:17:43 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

On 29.05.22 03:41, Tim Rice wrote:

Awesome analysis. I hadn't realized that Datamash would snip the first field when using -W, and I agree it should be consistent.

I have found an example for ignoring leading, but not trailing,
whitespace: "join".

By default, join uses whitespace as input field delimiter and a
single space as output field delimiter.  In this mode it ignores
leading whitespace, but adds a single trailing space for trailing
whitespace in the input:

Neither leading nor trailing whitespace:

$ join <(printf '1    one\n2    two\n') \
>      <(printf '1\teins\n2\tzwei\n') \
> | cat -A
1 one eins$
2 two zwei$

Leading whitespace only:

$ join <(printf '    1    one\n2    two\n') \
>      <(printf '1\teins\n\t2\tzwei\n') \
> | cat -A
1 one eins$
2 two zwei$

Both leading and trailing whitespace:

$ join <(printf '    1    one\n2    two    \n') \
>      <(printf '1\teins\n\t2\tzwei\t\n') \
> | cat -A
1 one eins$
2 two  zwei $

So the GNU Datamash --whitespace behavior is not unprecedented.

My personal inclination is to prefer allowing whitespace to delimit both the first and last field, i.e. stop ignoring a leading space. In my mind, this is the more intuitive interpretation of what whitespace delimiting should mean.

This is one consistent interpretation, i.e., any run of whitespace
characters separates two fields.  But it makes leading and trailing
whitespace special, because they can introduce empty fields, while
all other whitespace cannot do so.

On the other hand, it sounds like compatibility with other tools means both should be ignored. And I am okay with that too.

IMHO this compatibility with other tools is important.

Anyone who does need empty fields with whitespace delimiters is probably already using a convention like "NA" or "-" instead of just leaving the field blank.

That seems likely, because that is the only way to describe an empty
field in the middle of a data line with whitespace as delimiter.

I wonder if an extra flag (--ignore-terminal-space or so) to toggle the behavior might be justified. Probably not.

Given that the behavior has been in Datamash since 2016/2017, I also wonder if we should just leave it alone, to avoid breaking someone's scripts.

I do not like changing long-standing behavior, because someone
might rely on that.  But I do not like using a common description
like "whitespace delimited" to describe a rather uncommon
implementation either.

A new option as an alternative to -W, --whitespace, e.g., -B,
--blanks, could be added that ignores both leading and trailing
whitespace when determining data fields.

That is not ideal either, because some current users might prefer
the new behavior with the new option, but may not have learned that
is was introduced.  New users of one of the options would need to
decide which one to use.

Another option: Datamash could try to detect when the odd behavior is being used, and print a deprecation warning? Then wait until v2.1 before changing it.

That is a possibility.  I think this often works less well than
intended, e.g., when users only get the new behavior after upgrading
to a new distribution release, they might skip the deprecation
warning entirely.

The lowest hanging fruit here is to make sure the documentation describes the current behavior. I don't think I'll get it done today, but I've added it to my todo list.

I have just added a sentence to the texinfo documentation.

If we change the behavior, we need to change the documentation
in addition to the code and tests.  I think this makes everything
more explicit and thus simpler to reason about.

We could add a new option to ignore both leading and trailing
whitespace for whitespace delimited fields, and add examples with
similar behavior to the documentation, e.g., -W is similar to join
and -B is similar to AWK.

Br,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]