bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Fixed incomplete and incorrect treatment of comments and tra


From: Erik Auerswald
Subject: Re: [PATCH] Fixed incomplete and incorrect treatment of comments and trailing whitespace
Date: Tue, 17 May 2022 09:04:53 +0200

Hi Dima,

On Mon, May 16, 2022 at 09:25:07AM -0700, Dima Kogan wrote:
> Erik Auerswald <auerswal@unix-ag.uni-kl.de> writes:
> > On Sun, May 15, 2022 at 06:06:21PM -0700, Dima Kogan wrote:
> >
> >> Addresses two related issues:
> >> 
> >> - Comments that didn't block out a whole line weren't being properly 
> >> ignored by
> >>   -C. Lines such as 'bar 5#xxx' didn't ignore the '#xxx' as they were 
> >> supposed
> >>   to
> >
> > I think that would be a new feature.  The --help output states:
> >
> >     -C, --skip-comments       skip comment lines (starting with '#' or ';'
> >                                 and optional whitespace)
> >
> > As far as I understand the documentation, the -C, --skip-comments option
> > was intended to skip complete lines.
> 
> Huh. The docs do indeed describe the observed behavior. But this
> behavior isn't how comments work anywhere else, and breaks everybody's
> expectations of how comments should be interpreted.

I beg to differ.  It is not obvious to me that simple tabular data has
any notion of comments inside a data row, neither till the end of the data
line nor inside a data field.  Comment lines, i.e., a complete line that
does not contain any data, is a simple extension to simple tabular data.

> I think we should take the patch AND we should update the docs.

I'd suggest that a new option should be used to activate such an extended
comment support.

> > Treating any ';' in a line as starting a comment would interfere with
> > using ';' as field separator.  But using ';' as field separator is common
> > with simple CSV-like formats when the locale's decimal separator is a ','.
> 
> Using the comment character as the field separator shouldn't work. Does
> anybody expect it to?

It does not seem likely that people using a semicolon separated values
format would expect the semicolon to act as a comment character.

They might think that '#' acts as a comment character which could be
used to add comment lines to the data.  Thus I think it could be useful
to specify which character(s) shall be interpreted as starting a comment.

The following currently works:

    $ printf -- '# shell comment\n1;2;3\n; lisp comment\n4;5;6\n7;8;9\n'
    # shell comment
    1;2;3
    ; lisp comment
    4;5;6
    7;8;9
    $ printf -- '# shell comment\n1;2;3\n; lisp comment\n4;5;6\n7;8;9\n' | 
datamash -C -t\; sum 1-3
    12;15;18

To illustrate why using a semicolon separated value data format can
be useful:

    $ echo $LC_NUMERIC
    de_DE.UTF-8
    $ printf -- '# shell comment\nfirst;second;third\n1,01;2,02;3,03\n; lisp 
comment\n4,04;5,05;6,06\n7,07;8,08;9,09\n' | datamash -C -t\; -H sum 1-3
    sum(first);sum(second);sum(third)
    12,12;15,15;18,18

Extending -C, --skip-comments to interpret both '#' and ';' inside a
data line as starting a comment would break the above use cases.

The -C -t';' combination does not work if any data line starts with an
empty field.  A new option to set only '#' as comment character would
make that work.

Kind regards,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]