coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Extend uniq to support unsorted list based on hashtable


From: Yair Lenga
Subject: Extend uniq to support unsorted list based on hashtable
Date: Sat, 30 May 2020 07:16:47 +0300

Greetings,

Wanted to suggest that the team will look (again) at implementing
--unsorted option for 'uniq'.

The idea was proposed (and rejected) about 10 years ago
(https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html).
Lot of things have changed from the past.

First the justification, which was clearly stated in the original post::

'uniq' currently relies on 'sort'. When the input file is small, this
is OK. But when the input file is large, this seems to be a waste (the
complexity is O(n log(n)), if uniq handles a hash table its self the
complexity is only O(n)). I'm wondering if it is better to relax the
requirement of 'sort' when 'uniq' is used.

I think it is worth reconsidering based on the following:

Based on my experience, a very common use case is to count frequency
of small number of buckets from a large input set (for example, count
number of requests for each IP IP addresses from very large web server
log file). Currently, this is handle by using a pipe ilke 'awk { print
...}' input-file | sort | uniq -c'.

To count Z buckets from input size of N, the performance of the above
pipe is N log(N), with memory requirement to hold N input in memory
(otherwise actual performance for large data set is extremely slow).
With the proposed `uniq --unsorted`, performance is O(n), memory
requirement to hold only Z input lines in memory (not to mention much
better constant). The difference is significant consider the input
size of today log files: could be a factor of 20, when input has
millions of lines.

Other common use case will be to use this as a "filter", where the
input arrive from "tailing" the file. For example "tail -f logfile |
awk '{ print ... }' | uniq --unsorted" will look for new unique lines
in the file. Not possible with the sort approach.

The main objection in previous discussion was that implementation is
complex seems a little bit outdated. The creutils have en extended
over time with advanced features, beyond what could implemented by few
simple lines of code on top of the standard library. There are many
good implementation of hashing that can be be implemented, which will
not increase code footprint by more than few tens of lines.

Secondary objection was that it will be impossible to handle the case
of infinite input site. Well, this argument is true for many problem -
but given today hardware, this is unlikely to be an issue for all
practical usage.

Can you advise/provide feedback. I'm sure that there will be many
volunteers (me included) to contribute to such important improvement.

Yair



reply via email to

[Prev in Thread] Current Thread [Next in Thread]