Re: coreutils feature requests?

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: coreutils feature requests?

From:	Assaf Gordon
Subject:	Re: coreutils feature requests?
Date:	Wed, 19 Jul 2017 13:53:42 -0400

Hello Lance and all,

On Jul 19, 2017, at 13:03, Lance E Sloan <sloanlance+coreutils_gnu.org@
gmail.com> wrote:

I'd appreciate it if you could explain why you're opposed
to adding new features to cut (or to comm).


If I may chime in about this:

There is also a delicate balance between adding more features and leading
to bloated software, and keeping the program lean but providing less
functionality.

Sometimes, it boils down to a judgement call of the maintainers.

And then there's also the unix philosophy of having a tool "do one thing
and do it well".

It may help if I explain my point of view.


I think that what would help the most is if you can share
an actual problem that you have (i.e. the input and desired output),
and perhaps we can find a good solution using existing tools.

My considerations for a solution:

1.  I need this feature to process several files that have millions of
lines each.  I need to do this on an ongoing, periodic basis.  I can't
afford for the process to be slow.


Here's a concrete example:

===

$ time wc -l 1.txt
18833902 1.sql

real    0m0.442s
user    0m0.232s
sys     0m0.208s

$ time cut -f1,3,5 1.txt > /dev/null

real    0m2.923s
user    0m2.736s
sys     0m0.188s

$ time mawk '{print $3,$1,$5}' 1.txt > /dev/null

real    0m4.903s
user    0m4.680s
sys     0m0.224s

====

Using existing tools ('mawk' in this case)
gives you all of awk's flexibility, at a slight increase of cost.
The example file had 18M lines - and we're still talking about just 4
seconds
of user time.


2.  Since I have a large amount of data, I'm avoiding regular expressions
and interpreted languages, which take longer to complete the job.  That
eliminates awk and several other possible solutions.  A compiled C
application would be best.


I agree that regex on every line is slow, but awk just for the sake
of reordering lines will not require any regex.

4. Part of my data processing uses jq.  I've figured out how to do this
field reordering with it, but it makes my jq filter more complex and more
difficult for my successors to maintain.  As written on
https://stedolan.github.io/jq/ , "jq is like sed for JSON data".  I don't
consider sed to be a good solution for a problem of this size, so jq
probably isn't ideal, either.


(disclaimer: I'm a maintainer of GNU sed, and I've also contributed code to
jq).

This point confuses me a bit:
cut,awk,sed are all line-based tools: meaning your logical "records"
are expected to be on one line (at least - that is the most common usage of
the tools).

jq is JSON based - it doesn't at all need records to be contained in a
single line
(though it can do it with optional arguments).

Is your input file "one JSON record per line" ?
Or are you using 'jq' to read non-JSON input and treat it as an array?

In any case: 'jq' implements a small virtual machine to execute your script,
I'm not sure it would be the fastest tool for the job (or much faster
than sed/awk's implantation). It is certainly an "interpreted language"
which you wrote above you are trying to avoid.

Similarly, saying "I don't consider sed to be a good solution" -
you haven't yet told us what your actual need is.
So we can't tell if sed is good or bad...

Since a C implementation should run the fastest and cut from GNU's
coreutils is written in C and presumably doesn't need much work to support
this, it seems like the best solution.

Even if this feature suggestion isn't approved by the GNU community, I will
implement it for my own use anyway.  I can enjoy the new functionality
(which I think should have been added to cut long ago) and keep it to
myself or I can contribute it back to the online community.  I could
distribute it as my own fork of GNU coreutils or as a patch to it.
However, if it were merged into GNU's coreutils, it would get the most
exposure and be helpful to more people.


cut's implementation is optimized for cutting columns and
not for reordering them.
I think that if you try to add code to 'cut' that allows reordering
of output fields, you'll discover that while it's very doable,
it also significantly complicates the code.

A previous message on this thread stipulated that it takes
extra effort to 'sort the columns' - that is incorrect for the current
implementation.

Regardless,
If you actually implement it - please do send the patch.
To be considered for inclusion, it will need to be efficient (i.e. not make
'cut' slower
than it is now), be correctly implemented for all sorts of edge-cases,
and have good tests that cover the new functionality.
Updating manual pages and the documentation is needed as well.
You'll also need to assign copyright of the patch to the FSF.

Good places to start:
http://git.savannah.gnu.org/cgit/coreutils.git/tree/HACKING
http://git.savannah.gnu.org/cgit/coreutils.git/tree/README-hacking

Here's an example of a patch that added a new feature to 'comm':
https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=
b50a151346c42816034b5c26266eb753b7dbe737
You can see that the kind of changes at accompany a new feature.




Then again,
Given that the canonical way to reorder columns for many decades has been:
awk '{print $9,$2,$6}'
and that this canonical way would 'just works'
on *any* existing posix system (think: every BSD, Solaris, AIX,
and systems such as AlpineLinux which use BusyBox instead of GNU coreutils)
-
there is a very high barrier to adding such non-standard feature.

regards,
- assaf

[Prev in Thread]

Current Thread

[Next in Thread]

Re: coreutils feature requests?, (continued)

Prev by Date: Re: coreutils feature requests?
Next by Date: Re: Determination of file lists from selected folders without returning directory names
Previous by thread: ENVVars Removal and functional replacements (was Re: coreutils feature requests?)
Next by thread: Re: coreutils feature requests?
Index(es):
- Date
- Thread