coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

uniq: field separator for -c option


From: Roni Kallio
Subject: uniq: field separator for -c option
Date: Thu, 08 Oct 2020 02:10:18 +0300
User-agent: mu4e 1.4.10; emacs 27.1

Hi all,

Recently I've been doing a lot of data processing.  Usually this
involves first transforming the source material into some
semi-structured format like CSV or TSV, after which I can start
gathering information about the data.  Both of these steps can be
achieved quite effectively using the programs packaged in coreutils.
For the times when more complex transformations are required I usually
end up using python and the plethora of third party libraries available.
Lastly I might plot the information into charts using gnuplot or
matplotlib.

Working with delimiter separated records is easy in most cases, as most
programs in coreutils allow the user to define the delimiter character.
One exception to this rule is the program `uniq' and its -c option,
which prefixes lines by the number of occurrences, as it doesn't allow
for defining the output delimiter character used between the count of
occurrences and the line itself, using space instead.

Take this excerpt from a file as an example, each record contains a
field for an address, country, and continent:

[...]
112.85.42.89,China,Asia
111.229.139.95,China,Asia
111.202.211.10,China,Asia
143.137.9.165,Brazil,South America
110.43.50.229,China,Asia
13.67.33.9,Singapore,Asia
79.8.196.108,Italy,Europe
104.248.244.119,Germany,Europe
106.12.31.186,United States,North America
156.67.217.63,Singapore,Asia
[...]

We might want to get the count of occurrences of each country,continent
pair.  To achieve this we filter out the addresses with cut, sort the
lines and pass the result to uniq, which then counts the occurrences:

< data.csv cut -d, -f2,3 | sort | uniq -c

Resulting in the following output:

      1 Brazil,South America
      4 China,Asia
      1 Germany,Europe
      1 Italy,Europe
      2 Singapore,Asia
      1 United States,North America

Which of course we need to post-process, since now the occurrence and
the country are effectively part of the same field, so we use the sed
command:

sed 's/\( \{0,6\}[[:digit:]]\+\) /\1,/' **

to get them into separate fields:

      1,Brazil,South America
      4,China,Asia
      1,Germany,Europe
      1,Italy,Europe
      2,Singapore,Asia
      1,United States,North America

Resulting in valid comma separated records, which we can then run
additional transformations on.  Counting occurrences is quite common, at
least for me, so this kind of post-processing has to be done quite often
as well.

I propose the addition of an option named `--count-separator' or similar
to the uniq command, to allow setting a user-defined separator
character. This separator character would be inserted between the
occurrence count and the line, and it would require the `-c' option to
also be used. Here is an example of its usage:

< data.csv cut -d, -f2,3 | sort | uniq -c --count-separator=,

The output would be equal to the output of the following program, which
I demonstrated earlier:

< data.csv cut -d, -f2,3 | sort | uniq -c | \
        sed 's/\( \{0,6\}[[:digit:]]\+\) /\1,/'

This would remove the need for an additional post-processing step
entirely, and bring the uniq program more in line with other
text-processing utilities in coreutils that already allow setting custom
field separators.

I have hacked on the implementation already, and attached a patch for
you to experiment and give feedback on.

--
Roni Kallio

** The sed command I use also removes leading spaces, but that isn't a
   necessary step so I left it out.

Attachment: 0001-uniq-count-separator-option.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]