[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: suggestion for new option: --block-break
From: |
Ole Tange |
Subject: |
Re: suggestion for new option: --block-break |
Date: |
Fri, 3 May 2019 22:10:48 +0200 |
On Fri, May 3, 2019 at 6:30 AM Cook, Malcolm <MEC@stowers.org> wrote:
> > From: Ole Tange <ole@tange.dk>
> > On Wed, Apr 24, 2019 at 12:06 AM Cook, Malcolm <MEC@stowers.org>
> > wrote:
:
> > Parsing each line into columns will be even slower. Probably similar to --
> > shard.
>
> Unfamiliar with "shard" in this context.
--shard (version 20190222 and later)
> > With perl expression it could be something like:
> >
> > parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3
> > $_=substr($_,-2,2)'
>
> I'm not sure what that "3" is doing there - some character transliteration
> problem in our email?.
3 is column 3. So $_ will contain the value in column 3. If no number
given, then $_ is the full line.
This will make it slightly harder distinguishing between a named
column or some perl code. But I think it is OK to assume:
* --block-breaks value contains only [a-z0-9_] and --header : is set
=> Named column
* perl code otherwise
> > You are basically asking for an option so you do not have to write:
> >
> > cat foo.tsv |
> > perl -F"\t" -ape 'local $_=$F[3]; $_=substr($_,-2,2); if($_ ne
> > $last) { print "rEcOrDsEp" } $last=$_' |
> > parallel --pipe --recstart rEcOrDsEp --rrs --cat --block 10K wc
> >
> > (with -F = --colsep, [3] = the name/number of the
> > column,$_=substr($_,-2,2) being the perlexpr, and rEcOrDsEp being a
> > randomly generated string that hopefully will not occur in your input).
> >
> > Is that correctly understood?
>
> I think you've got it. That is pretty much what I wound up doing.
Good.
> And I appreciate your observations about performance above, but, truth be
> told, the performance hit has to be taken somewhere, either in the upstream
> perl process or interwoven with `parallels` logic.
That is a valid argument.
Also GNU Parallel is known for having options that are simply
activating wrapper scripts, so it is not completely new territory.
> BTW: Another possible "metaphor" that might be useful in documenting such an
> option, should you care to implement it, is that of "keeping selective
> consecutive records together that have some property in common".
Yeah, I really do not like the name --block-breaks. I like --group-by
a little better, but not 100% happy with that either.
So dear mailing list: Please come up with better names and description
for the man page would also be nice.
/Ole