coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: decorate - new sorting-helper program (experimental)


From: Assaf Gordon
Subject: Re: decorate - new sorting-helper program (experimental)
Date: Fri, 24 Apr 2020 20:54:04 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0

Hello,

Just a quick note that the "decorate" program (explained below)
was just released as part of GNU datamash 1.7:

   https://lists.gnu.org/archive/html/info-gnu/2020-04/msg00011.html

Comments, suggestions and feedback are very welcomed.



On 2020-04-13 1:14 p.m., Assaf Gordon wrote:
Hello,

I'm happy to announce the first experimental release of the "decorate"
program.

'decorate' works in tandem with coreutils' sort(1) to allow new sorting
methods (e.g. IP addresses, roman numerals, string lengths).

This is a new program but an old idea, suggested by Pádraig here:
  https://lists.gnu.org/r/bug-coreutils/2015-06/msg00076.html

---

The program is part of the "datamash" package, and available here:
   https://alpha.gnu.org/gnu/datamash/datamash-1.5.17-735b.tar.gz

"./configure && make" should give you the "decorate" executable.

The rest of this (long) email shows usage information and examples.

This is an experimental version, and everything could still change.

Comments, suggestions and feedback are *very* welcomed.

regards,
  - assaf

----------------------------------------------------


#### General Usage #####

The general idea is:
1. convert a field of an input file to a format  that can be easily
    sorted by sort(1), e.g., converting roman numerals
    to their decimal equivalent or IPv4 addresses to 32 bit hex value.
2. Pass this converted (=decorated) input to sort
3. remove (=undecorate) the converted fields.

Example 1:

   ### convert roman-numerals, add new field
   $ printf "%s\n" C V III IX XI | ./decorate -k1,1:roman --decorate
   000000000000000000100 C
   000000000000000000005 V
   000000000000000000003 III
   000000000000000000009 IX
   000000000000000000011 XI

   ### combine decorate-sort-undecorate
   $ printf "%s\n" C V III IX XI \
        | ./decorate -k1,1:roman --decorate \
        | sort -k1,1  \
        | ./decorate --undecorate 1
   III
   V
   IX
   XI
   C


#### Easy/automatic 'decorate-sort-undecorate' method ####

Since the decorate-sort-undecorate pattern is repetitive,
the "decorate" program can execute 'decorate + sort + undecorate' automatically (forking + piping to sort and back).

This is done when "--decorate" and "--undecorate" arguments are *not* specified (i.e. - decorate is used as a 'sort' wrapper):

   $ printf "%s\n" C V III IX XI | ./decorate -k1,1:roman
   III
   V
   IX
   XI
   C


#### Conversions Syntax #####

The -k/--key specification follows sort(1), with the addition
of allowing a conversion function name following ":" (colons).

Examples:

   $ printf "MMXX III\n" | ./decorate --decorate -k1,1:roman
   000000000000000002020 MMXX III
   $ printf "MMXX III\n" | ./decorate --decorate -k1.2,1:roman
   000000000000000001020 MMXX III
   $ printf "MMXX III\n" | ./decorate --decorate -k1,1:strlen
   000000000000000000004 MMXX III
   $ printf "MMXX III\n" | ./decorate --decorate -k1:strlen
   000000000000000000008 MMXX III

The "r" (=reverse) flag can also be used:

   $ printf "%s\n" X I IV IX VI | ./decorate -k1,1:roman
   I
   IV
   VI
   IX
   X

   $ printf "%s\n" X I IV IX VI | ./decorate -k1,1r:roman
   X
   IX
   VI
   IV
   I

Available conversions methods:
   as-is        copy as-is
   roman        roman numerals
   strlen       length (in bytes) of the specified field
   ipv4         dotted-decimal IPv4 addresses
   ipv6         IPv6 addresses
   ipv4inet     number-and-dots IPv4 addresses (incl. octal, hex values)

Examples:

   $ printf "%s\n" 10.2.3.4  8.9.7.3 | ./decorate --decorate -k1,1:ipv4
   0A020304 10.2.3.4
   08090703 8.9.7.3

   $ printf "%s\n" 10.010.0x10.10 192.168 \
         | ./decorate --decorate   -k1,1:ipv4inet
   0A08100A 10.010.0x10.10
   C00000A8 192.168

   $ printf "%s\n" :: 2000::1234 ::ffff:192.168.1.42 \
       | ./decorate --decorate -k1,1:ipv6
   0000:0000:0000:0000:0000:0000:0000:0000 ::
   2000:0000:0000:0000:0000:0000:0000:1234 2000::1234
   0000:0000:0000:0000:0000:FFFF:C0A8:012A ::ffff:192.168.1.42


#### Mixing -k/--key for decorating and sorting ####

When 'decorate' automatically runs sort(1), any keys
that are not used for decoration are passed to 'sort'
(after being adjusted for the right column).

Example:

   $ printf "%-2s %d\n" C 4 IC 1 I 107 II 4 C 31 I 19 \
       | ./decorate -k1,1:roman -k2nr,2
   I  107
   I  19
   II 4
   IC 1
   C  31
   C  4


   $ printf "%-2s %d\n" C 4 IC 1 I 107 II 4 C 31 I 19 \
     | ./decorate -k2n,2 -k1,1:roman
   IC 1
   II 4
   C  4
   I  19
   C  31
   I  107

To better understand what parameters are passed to sort(1),
use "--print-sort-args" (which only prints the arguments to be used
with sort(1) but does not decorate or sort the input):

Here, "decorate" knows that a new field will be added
(the converted roman numerals), and so the "-k2nr,2"
is adjusted to be "-k3,3nr":

   $ ./decorate --print-sort-args -k1,1:roman -k2nr,2
   sort -k1,1 -k3,3nr

Here, "decorate" will add two fields (first ipv4 from field 2,
and roman numerals from field 3). The "-k5,5V" is adjusted
to be "-k7,7V":

   $ ./decorate --print-sort-args -k5,5V -k2,2:ipv4 -k3,3:roman
   sort -k7,7V -k1,1 -k2,2


#### Other sort(1) parameters ####

When 'decorate' automatically runs sort(1), several common sort(1)
options are accepted and passed as-is to sort.

Example:

     $ ./decorate --print-sort-args -k2,2:ipv4 \
                      --stable \
                      -T /foo/bar \
                      -S 2G \
                      -t: \
                      --parallel 32
     sort -k1,1 -s -T /foo/bar -S 2G -t : --parallel 32

The above example just prints the arguments,
but the same arguments will be sent to sort(1) if
"--print-sort-args" was not used.


#### Future improvements ####

I plan to also add a "--header" option - something that has
been requested many times for sort(1).
Since we're not worried about bloat here, and we're already
manipulating the input and output for sort as a child-process,
it will be easy to implement.


There is also a plan to add an option to specify an external program
as the conversion filter, e.g.:

    -k1,1@/foo/bar/filter.sh

Which will send the keys to the script.
The argument parser supports it but the actual implementation is missing.


#### Adding conversions ####

The file 'src/decorate-functions.c' contains the built-in conversion
functions.  Implementation is very simple: accepts a "const char*"
and print to STDOUT the converted/decorate representation.

It will be easy to add more conversions (assuming the conversions
rules are solid and will 'just work' with regular sort(1) alphabetic
order ).





reply via email to

[Prev in Thread] Current Thread [Next in Thread]