coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Extend uniq to support unsorted list based on hashtable


From: Assaf Gordon
Subject: Re: Extend uniq to support unsorted list based on hashtable
Date: Fri, 29 May 2020 22:47:26 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0

Hello,

On 2020-05-29 10:16 p.m., Yair Lenga wrote:
Wanted to suggest that the team will look (again) at implementing
--unsorted option for 'uniq'.

The idea was proposed (and rejected) about 10 years ago
(https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html).
Lot of things have changed from the past.

[...]

Can you advise/provide feedback. I'm sure that there will be many
volunteers (me included) to contribute to such important improvement.

"uniq" is standardize by POSIX to work on "comparing adjacent lines"
(from: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html ) - hence the requirement to pre-sort the input.

While it could be extended with a completely different hash-based
implementation, I don't think this is likely to happen.

As an alternative (and a shameless plug), allow me to point to
GNU Datamash ( https://www.gnu.org/software/datamash/ ).
On one hand, it already has a hash-based implementation to
remove duplicated fields (called "rmdup").
consider the following contrived example:

  $ (printf "%s\t%s\n" 9 B 3 A ; seq 10 | paste - -) | datamash rmdup 1
  9     B
  3     A
  1     2
  5     6
  7     8

And on the other hand, because 'datamash' is non-standard,
there's less of a problem in adding new functionality (i.e. "bloat" is
not as big as a concern as it is for coreutils).

Hope this helps.

regards,
 - assaf





reply via email to

[Prev in Thread] Current Thread [Next in Thread]