bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Fixed incomplete and incorrect treatment of comments and tra


From: Erik Auerswald
Subject: Re: [PATCH] Fixed incomplete and incorrect treatment of comments and trailing whitespace
Date: Sat, 28 May 2022 13:04:56 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

Hi all,

On 28.05.22 00:08, Tim Rice wrote:

Sorry I'm late to the party. I'm now ready to give this topic some attention.

- Handle trailing whitespace correctly:
[...]
 https://github.com/dkogan/datamash/commit/19f9bff1df89f24ccc5a957f0175ef1c32559caa

I wonder what people think the correct behavior should be for data generated like so:

```
#! /bin/bash
data=testing.txt
cat > $data << EOF
bar 5
bbb
EOF
sed -i '2s/$/   /' $data
```

That is,

bar 5
bbb
The second line has trailing spaces. At the moment, Datamash handles this in a way that is arguably correct:

I am not so sure about this.  For this example without leading
whitespace, this can be seen as correct.  But it is not the only
common way to treat trailing whitespace.

```
$ ./datamash -W transpose < ~/tmp/testing.txt
bar     bbb
5
```

I think this is rather subtle, because GNU Datamash does not create
an error but accepts the data because of the invisible trailing
whitespace.  But it would not do this with a space as field delimiter
(-t' ').

I'd say that currently GNU Datamash is inconsistent in that it treats
leading and trailing whitespace differently with -W, --whitespace.

With the default character separated format, every delimiter has both
a preceding and following field:

$ printf '\t1\t2\t\n' | ./datamash --output-delimiter=, reverse
,2,1,

Both leading and trailing separators add empty fields.

But with -W, leading whitespace is ignored:

$ echo '  1' | ./datamash -W --output-delimiter=, reverse
1

With -W, only trailing whitespace adds an empty field:

$ echo '1  ' | ./datamash -W --output-delimiter=, reverse
,1

This is different for Awk.  In the following example, Awk always
ignores both leading and trailing whitespace and sees two fields:

$ printf '1.1   1.2\n  2.1   2.2\n3.1  3.2  \n  4.1   4.2  \n' \
> | cat -A
1.1   1.2$
  2.1   2.2$
3.1  3.2  $
  4.1   4.2  $
$ printf '1.1   1.2\n  2.1   2.2\n3.1  3.2  \n  4.1   4.2  \n' \
> | awk '{print "line " NR ": " NF " field(s)"}'
line 1: 2 field(s)
line 2: 2 field(s)
line 3: 2 field(s)
line 4: 2 field(s)

This is similar when using the "read" builtin in Bash.

When the whole line is read into a single variable, both leading
and trailing whitespace is omitted:

$ printf '1.1   1.2\n  2.1   2.2\n3.1  3.2  \n  4.1   4.2  \n' \
> | while read -r THE_LINE; do printf -- '"%s"\n' "$THE_LINE"; done
"1.1   1.2"
"2.1   2.2"
"3.1  3.2"
"4.1   4.2"

When reading the two fields into separate variables, all whitespace
is omitted:

$ printf '1.1   1.2\n  2.1   2.2\n3.1  3.2  \n  4.1   4.2  \n' \
> | while read -r A B; do printf -- '"%s" "%s"\n' "$A" "$B"; done
"1.1" "1.2"
"2.1" "2.2"
"3.1" "3.2"
"4.1" "4.2"

When Bash parses a command line, it ignores both leading and
trailing whitespace, too:

$ export L='   1    2    3    '
$ num_args() { echo "got $# arguments"; }
$ printf '"%s"\n' "num_args $L"
"num_args    1    2    3    "
$ eval "num_args $L"
got 3 arguments
$ num_args $L
got 3 arguments

Python similarly ignores both leading and trailing whitespace:

$ python3 -c 'print("  1  2  3  ".split())'
['1', '2', '3']

IMHO leading and trailing whitespace should be handled identically.
Either both separate two fields, i.e., leading whitespace indicates
an empty field at the start of the line and trailing whitespace
indicates an empty field at the end of the line, or both do not
indicate this leading resp. trailing empty field.

Keeping both empty fields is similar to using "tr -s '[:space:]' '\t'"
to prepare input for GNU Datamash:

$ echo '   1   2    ' | tr -s '[:space:]' '\t' \
> | ./datamash --output-delimiter=, reverse
,2,1,

Omitting both empty fields is similar to how Awk and Bash parse a
line containing whitespace separated fields.

GNU Datamash has introduced a different way by ignoring leading
whitespace, but not trailing whitespace.  This uncommon way is
not explained in the documentation.

[BTW the whitespace handling of "sort" is different from any of
 the above.  I do not see it as the inspiration for -W, --whitespace
 in GNU Datamash either.]

Furthermore, if trailing whitespaces are a problem for you, they can easily be removed by sed. I'm not convinced that datamash should need to handle all aspects of cleaning up messy data.

My interpretation of why GNU datamash has a -W, --whitespace option
is that this should be useful with existing data that works for other
software that uses a sequence of whitespace characters to delimit
fields.  As I see it, ignoring both leading and trailing whitespace
is common.  Ignoring only leading, but not trailing, whitespace is
uncommon.

GNU Datamash creating an additional uncommon input format with
-W, --whitespace seems not that useful to me.

Thus I do think that -W, --whitespace should ignore trailing whitespace,
just as it already ignores leading whitespace.

Kind regards,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]