bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#10287: [wishlist] uniq can remove non adjacent lines


From: Bob Proulx
Subject: bug#10287: [wishlist] uniq can remove non adjacent lines
Date: Tue, 13 Dec 2011 11:06:50 -0700
User-agent: Mutt/1.5.21 (2010-09-15)

Jim Meyering wrote:
> Bob Proulx wrote:
> > If you want to print only the first of a unique line then this perl
> > one-liner will do it.
> >
> >   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
> 
> Thanks, but with large files, isn't it better to store not
> the full line, but rather a constant?
> 
>   perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1'

Good point!  I hadn't given it much thought since it usually runs so
quickly in my usage that I never worried about it.

> (actually, using "1" could be seen as misleading, since 0 or even undef
> would also work)
> 
> I think you can drop the "l".
> I have a slight preference for this:
> 
>   perl -ne 'defined $seen{$_} or print; $seen{$_}=1'

Refering to "print" v. "print $_" here I have never liked implicit use
of $_ and so I tend to avoid it.  At one time there was a push in the
perl community to make all uses explicit.  And as to whether to use
the 'if (expr) { stmt }' or 'stmt if expr' or 'expr or stmt' forms is
a matter of taste.  Might as well discuss the one true indention and
brace styles.  :-)  For one-liners I do tend to use short variables
to keep the line length minimized.  In order to compact a line I also
sacrifice whitespace when required.

But you have me thinking about conserving memory.  If the file was
large due to long lines then memory use would be proportionately large
due to the key storage needs.  This could be reduced by using a hash
of the line as the storage key instead of the entire line.  But the
savings would be relative to the average line size.  If the average
line size was smaller than the hash size then this would increase
memory use.

  perl -MDigest::MD5=md5 -lne '$m=md5($_); print $_ if ! defined $a{$m}; 
$a{$m}=1'

If you are ever going to debug and print out the md5 value then
substitute md5_hex for md5 to get a printable result.

Bob





reply via email to

[Prev in Thread] Current Thread [Next in Thread]