bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Fixed incomplete and incorrect treatment of comments and tra


From: Erik Auerswald
Subject: Re: [PATCH] Fixed incomplete and incorrect treatment of comments and trailing whitespace
Date: Sat, 28 May 2022 21:04:40 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

Hi all,

On 28.05.22 13:04, Erik Auerswald wrote:
Hi all,

On 28.05.22 00:08, Tim Rice wrote:

Sorry I'm late to the party. I'm now ready to give this topic some attention.

- Handle trailing whitespace correctly:
[...]
 https://github.com/dkogan/datamash/commit/19f9bff1df89f24ccc5a957f0175ef1c32559caa

I wonder what people think the correct behavior should be for data generated like so:

```
#! /bin/bash
data=testing.txt
cat > $data << EOF
bar 5
bbb
EOF
sed -i '2s/$/   /' $data
```

That is,

bar 5
bbb
The second line has trailing spaces. At the moment, Datamash handles this in a way that is arguably correct:

I am not so sure about this.  For this example without leading
whitespace, this can be seen as correct.  But it is not the only
common way to treat trailing whitespace.

```
$ ./datamash -W transpose < ~/tmp/testing.txt
bar     bbb
5
```

I think this is rather subtle, because GNU Datamash does not create
an error but accepts the data because of the invisible trailing
whitespace.  But it would not do this with a space as field delimiter
(-t' ').

I'd say that currently GNU Datamash is inconsistent in that it treats
leading and trailing whitespace differently with -W, --whitespace.

I think this has been introduced in git commit
f51b7aa79a34e245422cd55d272072b65bc366b8
"datamash: bugfix: trailing delimiter should count as an empty field"

A description can be found in:
"Re: [Bug-datamash] Regarding 'transpose'"
https://lists.gnu.org/archive/html/bug-datamash/2016-09/msg00004.html

I think the following NEWS entry describes the change:

---
* Noteworthy changes in release 1.1.1 (2017-01-19) [stable]

** Bug fixes

  'check' command correctly counts a trailing delimiter at end of lines.

  'transpose' command correctly handles missing fields on the last line.
---

It seems as if the behavior was changed for both the default mode
and for -W/--whitespace, but the description in the mailing list
post and the problematic input data pertain only to the default
mode of GNU Datamash of using a single TAB as field delimiter.

Before that change, -W would have ignored both leading and trailing
whitespace.  Without -W, only trailing whitespace would have been
ignored.  IMHO only the default (no -W) behavior should have been
changed.

The inconsistency can be demonstrated with simple input:

$ printf -- '\t1\n1\t\n' | cat -A
^I1$
1^I$

I would expect GNU datamash to always see both lines as having the
same number of fields, but with -W it does not:

$ printf -- '\t1\n1\t\n' | ./datamash check
2 lines, 2 fields
$ printf -- '\t1\n1\t\n' | ./datamash -W check
line 1 (1 fields):
        1
line 2 (2 fields):
  1
./datamash: check failed: line 2 has 2 fields (previous line had 1)

IMHO the example with -W should have worked as follows:

$ printf -- '\t1\n1\t\n' | ./datamash -W check
2 lines, 1 field

[...eliding quite a few examples...]
IMHO leading and trailing whitespace should be handled identically.

I think the above is really important.

[...]
GNU Datamash has introduced a different way by ignoring leading
whitespace, but not trailing whitespace.  This uncommon way is
not explained in the documentation.

The GNU Datamash behavior is both uncommon and not well documented.
There are tests that check the current behavior, but without giving
a rationale for why this behavior is checked.

We should at least clearly document how -W, --whitespace is supposed
to work.

I think we should also change how it treats trailing whitespace,
including adjusting the tests to check the changed behavior.

[...]
Furthermore, if trailing whitespaces are a problem for you, they can easily be removed by sed. I'm not convinced that datamash should need to handle all aspects of cleaning up messy data.

My interpretation of why GNU datamash has a -W, --whitespace option
is that this should be useful with existing data that works for other
software that uses a sequence of whitespace characters to delimit
fields.  As I see it, ignoring both leading and trailing whitespace
is common.  Ignoring only leading, but not trailing, whitespace is
uncommon.

GNU Datamash creating an additional uncommon input format with
-W, --whitespace seems not that useful to me.

Thus I do think that -W, --whitespace should ignore trailing whitespace,
just as it already ignores leading whitespace.

Thanks,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]