guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: scm_string_hash suboptimal?


From: Dirk Herrmann
Subject: Re: scm_string_hash suboptimal?
Date: Tue, 29 May 2001 12:39:10 +0200 (MEST)

On 29 May 2001, Rob Browning wrote:

> Florian Weimer <address@hidden> writes:
> 
> > Looking for a good hash function for strings, I stumbled across
> > scm_string_hash guile-core/libguile, but it doesn't look too well:
> 
> Actually the current hash in symbols.c actually downcases all the
> characters, so it's even less discriminating.  (Shouldn't we have a
> separate string hasher that's case sensitive?)

Yes, I'd vote for having separate scm_symbol_hash and scm_string_hash
functions.  I don't know whether these should be placed in symbols.c and
strings.c, though.

> So is anyone here well versed in the "state of the art" hash-table
> wise?  If there is a "right thing", I'd be happy to read up on it and
> perhaps implement it.  It would also be nice if I could add
> dynamically resizing tables, if anyone knows of a good documented
> algorithm.

Jay Glascoe once has implemented dynamically resizing tables.  I have a
copy of the code here.  However, his hash tables are intended as abstract
data types for use in applications.  For the case of the symbols
hashtable, however, a different implementation might be better, namely to
use a weak vector and put the SCM values directly in there.  If the slot
hash(obj) is taken, try hash(obj)+1 and so on.  This will be faster than
the current scheme which creates lists of objects, therefore needing
instructions for an additional level of indirection and also not making
use of data locality, which the weak-vector approach would.

> Just poking around randomly with google, I came across a somewhat
> interesting algorithm, but it's a lot more thorough than what we're
> doing, and I have no idea if it's actually any good -- hasn't been
> debunked, or whatever:
> 
>   http://burtleburtle.net/bob/hash/evahash.html

Might be interesting to compare it with the current one.  I see one
problem, though:  Guile's current hash table does not go through all
characters of a long string or symbol.  Instead, it only checks a limited
number (5 I think) of chars.  The algorithm above goes through the whole
length of a string, which can take a lot longer for long strings.

Does the current string handling function actually produce a lot of
identical hash values for non-identical strings in practice?  I guess that
most of the problem (if there is any) is rather due to the static table
size.  IIRC Han-Wen had reported a performance gain of 5% by just
increasing the hash table size (and some time ago the default size has
actually been increased in cvs).

Best regards,
Dirk Herrmann







reply via email to

[Prev in Thread] Current Thread [Next in Thread]