guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: scm_string_hash suboptimal?


From: Florian Weimer
Subject: Re: scm_string_hash suboptimal?
Date: 29 May 2001 23:25:40 +0200
User-agent: Gnus/5.090004 (Oort Gnus v0.04) Emacs/20.7

Dirk Herrmann <address@hidden> writes:

> Might be interesting to compare it with the current one.  I see one
> problem, though:  Guile's current hash table does not go through all
> characters of a long string or symbol.  Instead, it only checks a limited
> number (5 I think) of chars.  The algorithm above goes through the whole
> length of a string, which can take a lot longer for long strings.

Due to the expensive division operation, this effect is noticeable
only with very long strings.  The other hash functions are comparable
in speed even if the strings are 20 characters long.

For a special-purpose hash function, there's nothing wrong if it looks
only at select portions of the argument, but I don't think so in the
case of a general-purpose hash function.  In any case, the current
function has a severe flaw: it is very likely that long strings of the
same length which are not very random are hashed to the same hash
value because the characters looked at by the function are likely the
same each time, if the length of the string is constant.  (If you need
numbers, I can provide them, but I consider that so obvious that I'm
unwilling to do the math.)

> Does the current string handling function actually produce a lot of
> identical hash values for non-identical strings in practice?

Yes, some bit patterns are extremly common.  For lowercase words and
the explanation, see my original message. For upper-case English
words, its the 0x50505050 pattern which is most common, all words
starting with 'P' which are eight characters long hash to it, and for
the same reason, all eight-character words starting with 'H' hash to
0x48484848.  All in all, this results in many, many collisions on
almost any non-random input data.  The other hash functions I tested
are much better in this regard.

> I guess that most of the problem (if there is any) is rather due to
> the static table size.  IIRC Han-Wen had reported a performance gain
> of 5% by just increasing the hash table size (and some time ago the
> default size has actually been increased in cvs).

If the hash function is bad, performance drops much more rapidly when
the table is filled, in comparison to a sensible hash function.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]