Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as

From:	Dragan Simic
Subject:	Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as one
Date:	Thu, 28 Mar 2024 20:43:11 +0100

Hello,

Just checking, are there any further thoughts on this topic?


On 2023-08-16 09:25, Dragan Simic wrote:

On 2023-08-16 08:02, Rob Landley wrote:
On 8/15/23 06:31, Pádraig Brady wrote:
On 15/08/2023 11:22, Dragan Simic wrote:
On 2023-08-10 17:05, Dragan Simic wrote:
On 2023-08-01 20:37, Dragan Simic wrote:
On 2023-08-01 16:42, Pádraig Brady wrote:
On 01/08/2023 10:07, Dragan Simic wrote:
Add new command-line option and the required logic that allow
multiple
consecutive delimiters to be treated as a single delimiter.  Of
course,
this option is valid only with the cut's field mode.

This new feature should make cut much more usable in various
real-world
applications, some of which are already mentioned in thegotchas.
For
example, merging the consecutive delimiters is very useful whencut
is
used to process the outputs of various commands.
Add a whole battery of new cut tests, which cover this newfeature,
and
add more tests for the related already existing features, tomake
sure
no regressions are introduced.

While there, clean up the comments and the whitespace in the cut
tests
a bit, to make them slightly more readable.
Thanks for the patch.
I wonder whether a --empty-fields={ignore,suppress} is a moregeneral
interface.
I wonder would it be a more complex approach, and moreimportantly,less intuitive? Quite frankly, I think it's easier to visualizetheempty space. or the delimiters as a more general approach,becoming
"squeezed".  I think that visualizing the empty fields is harder,
especially when the delimiter is a whitespace character.
This overlaps somewhat with the -w option in FreeBSD's cut,
which merges runs of whitespace, and which I was also considering
adding.
After thinking a bit about it, how about having both "-m", fromthepatch I submitted, and "-w", which would behave differently thanthe
FreeBSD's "-w"?  Please, allow me to explain.
More specifically, our "-w" would simply "squeeze" all thewhitespace
in the input without forcing the delimiter to be whitespace.  The
"squeezing" would produce a whitespace character in the input,insteadof whatever got "squeezed" there. That would be either thewhitespacecharacter specified as an optional value for the "-w" option, oritmay by default produce a space wherever only spaces were"squeezed",
or a tab wherever the "squeezed" whitespace contained at least one
tab.

With both "-m" and "-w" options in place we'd end up with a quite
versatile cut, which would cover what FreeBSD's cut does, and beable
to do more.  I'd be willing to implement the "-w" option as well.
Just checking, any further thoughts on this approach?
This feature for cut has been hoped for more than a few times, hereare
a few examples:
-https://stackoverflow.com/questions/21322968/does-cut-support-multiple-spaces-as-the-delimiter-https://stackoverflow.com/questions/7142735/how-to-specify-more-spaces-for-the-delimiter-using-cut-https://unix.stackexchange.com/questions/109835/how-do-i-use-cut-to-separate-by-multiple-whitespace-https://unix.stackexchange.com/questions/606639/why-does-cut-d-not-work-with-space-in-this-case-https://unix.stackexchange.com/questions/387544/cut-with-2-character-delimiter-https://stackoverflow.com/questions/25447324/how-to-use-cut-with-multiple-character-delimiter-in-unix
I'd really appreciate if we could discuss this further.
Yes this functionality is definitely under consideration.
The interface is the main consideration for me at present.
I need to review the existing interfaces to see how best to proceed.
Would this be instead of the -DFO stuff, or in addition to?
The way I see it, "-m" would be an additional feature, not areplacement.
It seems to me that "delimiter can be a regex, which means it can beanarbitrary string if you don't use special characters or escape them",covers theuse case? And the default delimiter for -F _is_ a run of whitespace,because
it's the common case when replacing awk '{print $3,$7}'.
Sure, it would cover easily the cases when delimiters are whitespace,
but I think that having "-m" in addition would still be a good idea.
See, not engaging the regex engine is surely good performance-wise,
especially when running cut on large files, for the cases that don't
really require a regex to define the delimiters, etc.
This has worked in toybox (and busybox) for years now:

$ echo "one  two   three" | toybox cut -F 2
two
$ echo abconedefoneghi | toybox cut -F 2 -d one
def
$ echo abconeonedefoneoneoneghionejkl | toybox cut -F 2,3 -d '(one)+'-O potato
defpotatoghi

Prebuilt binaries you can play with:

https://landley.net/bin/toybox/0.8.10/
I fully support merging of the "-DFO" stuff, but I still think that
having "-m" in addition should be the way to go.
cheers,
Pádraig
Rob

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as one, Dragan Simic <=
- Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as one, Pádraig Brady, 2024/03/28
  - Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as one, Dragan Simic, 2024/03/28

Prev by Date: Re: Feature Request: env -f to read from file
Next by Date: Re: Date setting examples in manual
Previous by thread: Feature Request: env -f to read from file
Next by thread: Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as one
Index(es):
- Date
- Thread