bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gawk] hashing


From: Bonzini
Subject: [gawk] hashing
Date: Fri, 7 Jun 2002 23:28:52 +0200

Bruno Haible and I recently played with hash functions in gettext, and he
observed that gawk's function performs terribly on keys like 'ABCD' or
'1234'.  Hashing into a 2k table the 1296 strings [A-F][A-F][A-F][A-F] only
used 228 buckets.

If you are interested I can plug into gawk a different hash function from
Bob Jenkins that performs extremely well.  As a complementary advantage,
this hash function supports hash table sizes that are powers of two because
it scrambles the bits very well (no need for a slow modulo-prime operation
to do further scrambling).

Here are some statistics for this hash function (in this case it was used in
a double hashing scheme, but that does not matter much).  First I hashed the
4602 msgid strings from gcc.pot, using a table size of 8192 for Jenkins'
hash and 8089 for others.

Strings with no collisions: your hash 3277, gawk 3912, Jenkins 3922
Extra lookups needed: your hash 3605, gawk 1535, Jenkins 1518

(re. the second figure: 1 is added when there is a collision on the primary
hash only, 2 when there is also one on the secondary hash, 3 when there are
two collisions on the secondary hash, etc.)

gawk's hash made a good job on English text.

The 4-letter words from {A,B,C,D,E,F} (e.g. "ABAF") are 6^4=1296 strings.  I
used 2048 for Jenkins hash and 2029 for others (2027 is prime).  This is
indeed a worst case for gawk's hash, which used only 10% of the available
buckets:

Strings with no collisions: your hash 931, gawk 228, Jenkins 1180
Extra lookups needed: your hash 1199, gawk 3605, Jenkins 262

Are you interested in me making a patch for this?

Paolo Bonzini






reply via email to

[Prev in Thread] Current Thread [Next in Thread]