|
From: | Dragan Simic |
Subject: | Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as one |
Date: | Thu, 28 Mar 2024 20:43:11 +0100 |
Hello, Just checking, are there any further thoughts on this topic? On 2023-08-16 09:25, Dragan Simic wrote:
On 2023-08-16 08:02, Rob Landley wrote:On 8/15/23 06:31, Pádraig Brady wrote:On 15/08/2023 11:22, Dragan Simic wrote:On 2023-08-10 17:05, Dragan Simic wrote:On 2023-08-01 20:37, Dragan Simic wrote:On 2023-08-01 16:42, Pádraig Brady wrote:On 01/08/2023 10:07, Dragan Simic wrote:Add new command-line option and the required logic that allow multiple consecutive delimiters to be treated as a single delimiter. Of course, this option is valid only with the cut's field mode. This new feature should make cut much more usable in various real-worldapplications, some of which are already mentioned in the gotchas.Forexample, merging the consecutive delimiters is very useful when cutis used to process the outputs of various commands.Add a whole battery of new cut tests, which cover this new feature,andadd more tests for the related already existing features, to makesure no regressions are introduced. While there, clean up the comments and the whitespace in the cut tests a bit, to make them slightly more readable.Thanks for the patch.I wonder whether a --empty-fields={ignore,suppress} is a more generalinterface.I wonder would it be a more complex approach, and more importantly, less intuitive? Quite frankly, I think it's easier to visualize the empty space. or the delimiters as a more general approach, becoming"squeezed". I think that visualizing the empty fields is harder, especially when the delimiter is a whitespace character.This overlaps somewhat with the -w option in FreeBSD's cut, which merges runs of whitespace, and which I was also considering adding.After thinking a bit about it, how about having both "-m", from the patch I submitted, and "-w", which would behave differently than theFreeBSD's "-w"? Please, allow me to explain.More specifically, our "-w" would simply "squeeze" all the whitespacein the input without forcing the delimiter to be whitespace. The"squeezing" would produce a whitespace character in the input, instead of whatever got "squeezed" there. That would be either the whitespace character specified as an optional value for the "-w" option, or it may by default produce a space wherever only spaces were "squeezed",or a tab wherever the "squeezed" whitespace contained at least one tab. With both "-m" and "-w" options in place we'd end up with a quiteversatile cut, which would cover what FreeBSD's cut does, and be ableto do more. I'd be willing to implement the "-w" option as well.Just checking, any further thoughts on this approach?This feature for cut has been hoped for more than a few times, here area few examples:- https://stackoverflow.com/questions/21322968/does-cut-support-multiple-spaces-as-the-delimiter - https://stackoverflow.com/questions/7142735/how-to-specify-more-spaces-for-the-delimiter-using-cut - https://unix.stackexchange.com/questions/109835/how-do-i-use-cut-to-separate-by-multiple-whitespace - https://unix.stackexchange.com/questions/606639/why-does-cut-d-not-work-with-space-in-this-case - https://unix.stackexchange.com/questions/387544/cut-with-2-character-delimiter - https://stackoverflow.com/questions/25447324/how-to-use-cut-with-multiple-character-delimiter-in-unixI'd really appreciate if we could discuss this further.Yes this functionality is definitely under consideration. The interface is the main consideration for me at present. I need to review the existing interfaces to see how best to proceed.Would this be instead of the -DFO stuff, or in addition to?The way I see it, "-m" would be an additional feature, not a replacement.It seems to me that "delimiter can be a regex, which means it can be an arbitrary string if you don't use special characters or escape them", covers the use case? And the default delimiter for -F _is_ a run of whitespace, becauseit's the common case when replacing awk '{print $3,$7}'.Sure, it would cover easily the cases when delimiters are whitespace, but I think that having "-m" in addition would still be a good idea. See, not engaging the regex engine is surely good performance-wise, especially when running cut on large files, for the cases that don't really require a regex to define the delimiters, etc.This has worked in toybox (and busybox) for years now: $ echo "one two three" | toybox cut -F 2 two $ echo abconedefoneghi | toybox cut -F 2 -d one def$ echo abconeonedefoneoneoneghionejkl | toybox cut -F 2,3 -d '(one)+' -O potatodefpotatoghi Prebuilt binaries you can play with: https://landley.net/bin/toybox/0.8.10/I fully support merging of the "-DFO" stuff, but I still think that having "-m" in addition should be the way to go.cheers, PádraigRob
[Prev in Thread] | Current Thread | [Next in Thread] |