bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Fixed incomplete and incorrect treatment of comments and tra


From: Erik Auerswald
Subject: Re: [PATCH] Fixed incomplete and incorrect treatment of comments and trailing whitespace
Date: Sat, 21 May 2022 15:12:25 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1

Hi Dima,

On 20.05.22 18:16, Erik Auerswald wrote:
On Thu, May 19, 2022 at 11:47:44PM -0700, Dima Kogan wrote:
Erik Auerswald <auerswal@unix-ag.uni-kl.de> writes:

 From a quick glance at the code diff in the link, this seems to allow
comments inside a field, e.g., with datamash -H -C -t',' and the
following input:

     # the next line is the header line
     one,two,three
     # the following line is the data line
     1,2#this is the 2nd field,3

in the data line the string "#this is the 2nd field" would be skipped,
and the data line would have three fields with values 1, 2, and 3.

Is that correct (I did not test it)? Is that the intended
functionality?

I haven't tested this yet, but that wasn't my intent. Comments should do
what they do in perl and python and awk and everywhere else, so the last
line should be interpreted as "1,2". I may have made a mistake in the
implementation.

I have just tested it, this is not the way you're code behaves.
Everything beginning with the comment sign is ignored, as intended.

I misread the diff.

Just to be clear: I also would expect this behavior, i.e., the comment
                   sign and everything until the end of the line is
                   ignored as a comment.

I'd still prefer this change to comment handling to be off by default
and only activated with a new option.  E.g., as an option that modifies
-C behavor by adding support for in-line comments, or as an option that
ignores only in-line comments (I'd prefer the former).  Modifying the -C
behavior could automatically enable -C if it was not given explicitly,
to avoid having to add two options for the intended behavior.

As an additional thought on comment handling:

Currently, GNU datamash only knows of comment lines and can
ignore those.

Comments that start at the comment character and extend to the
end of the line can be seen as a more general kind of comment,
because a "comment line" is turned into an "empty line".

But currently, GNU datamash does not ignore empty lines, neither
by default nor via option:

    $ printf '1\n2\n' | datamash sum 1
    3
    $ printf '1\n\n2\n' | datamash sum 1
    datamash: invalid input: field 1 requested, line 2 has only 0 fields

This leads to the question: "what is an empty line?"  I usually
expect any "blank" line (i.e., matching ^[[:space:]]*$) as "empty."

The combination of ignoring in-line comments with ignoring blank
lines would ignore comment lines as well.  As such a separate
option to only activate support for in-line comments seems to
be fine.

Some more examples to illustrate this:

    $ printf '1\n ; comment line\n\n2 # in-line comment\n' \
    > | ./datamash --no-strict reverse
    1
     ; comment line

    2 # in-line comment
    $
    $ printf '1\n ; comment line\n\n2 # in-line comment\n' \
    > | ./datamash -C --no-strict reverse
    1

    2 # in-line comment
    $
    $ printf '1\n ; comment line\n\n2 # in-line comment\n' \
    > | sed 's/[[:space:]]*[#;].*$//;/^[[:space:]]*$/d' \
    > | ./datamash --no-strict reverse
    1
    2

Then there is the additional question of where an in-line comment
starts: at the comment character or at the first whitespace at a
run of whitespace characters directly preceding the in-line comment?

With the default of -t$'\t' or with -t' ', preceding whitespace
probably should not be removed.

With -W, preceding whitespace should probably be removed.  If trailing
whitespace is ignored with -W, this happens automatically.

With a non-whitespace separator character, both removing and keeping
whitespace preceding a comment could be seen as valid.

But this all depends on the specifics of the data format and user
expectations.

When input data is given as CSV (as in "character separated values"
where the character is neither a space nor a tab) format, an option
to trim whitespace at the start and end of a field could be useful.

I do think that adding convenience options to GNU datamash to ease
use with human-edited input data (e.g., with additional whitespace
and comments) is sensible.

I do not like software that gratuitously changes how it interprets
input data, especially if there is no option to revert to the former
behavior.  IMHO new interpretations should be activated using new
options (-W vs. trailing whitespace is an edge case IMHO, changing
the current behavior is probably a bug fix, at least when comparing
GNU datamash with Awk or Bash).

Kind regards,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]