guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Proposal for a new hash interface


From: Stefan Israelsson Tampe
Subject: Proposal for a new hash interface
Date: Thu, 24 Feb 2022 01:07:45 +0100

High all,

I have taken the task to study a possible hash table implementation mostly in scheme but still speedy. I am still experimenting with different approaches. With managing code and iteration higher order functions in scheme and lower order implemented in C. I can match current hash implementations but with hash-fold and hash-for-each 10-20X faster and better semantics as it is not going via C.

I suppose the implementation is more memory intensive than the current implementation. But much much safer and featurefull.

So what are the features:
* goops based organisation

* allows one to type a hash table, e.g. disallow the use of different versions
  of the assessors for the same hash.

* surprisingly fast although we define goops methods for the accessors

* A flag that indicates that (hash i) = i, for integers, which is used in python

* Possible to freeze has hash e.g. it will not grow or shrink, when cleared only
  all the values will be cleared.

* A possibility to use (hash x (ash 1 60)) as a key, e.g. when checking if a slot 
  is equal, we compare only the long hash.

* A proper separation between general code that can be pushed to C for speed    and scheme for nice integration and power

Code:

scheme:
https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm

C:
https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/resize.c

C/Scheme integration
Each handle is of the form
(hash key val)

So when we store the key value pair we also store the hash values which are
60bit wide. This means that we can dispatch the search for an item using only the hash value and many operations can be done in a general way that can be pushed to C (if we want). So typically you first try to use C and when it finds a match you need to do the final check in scheme using the correct equal predicate. This means that we can achieve a very extensible system that you can extend without losing much performance. These primitives make the scheme code beautiful and well organized so we should probably keep this semantic even if we decide to do everything in guile scheme.

Example, consider that we have designed a new hash table semantics. The easiest is to supply a hash function and a equal predicate like so

(use-modules (ice-9 hashish))
(use-modules (oop goops))

(define (myhash    k s) ...)
(define (myequal? x y) ...)

(define-class <my-hash-map> (<stis-hash-map>))
(stis-new-hash-type <my-hash-map> make-my-hash-table myhash myequal?
     my-hash-ref  my-hash-set!  my-hash-remove!)

Preferably when we do not need to know the type of the hashtable we can
either use the methods (slow) stis-hash-ref stis-hash-set! stis-hash-remove!

The fast way is to take advantage of the following construct,
(define H (make-my-hash-table))
(stis-with-hash-accessors (H : ref set remove)
    ... 
    (set k1 (ref k2 #f))
    ...
    (remove k3)
    ...)

Both these methods are safe.

There are some extra tools available.

> stis-hash-resize!
One can resize a hashtable as one likes

> stis-hash-define-world!
One can fill the hash table and then specify that this is the world. Then 
If all long hashes that are stored are unique, no equal method will be used
for identity and the hashmap will be freezed

> stis-hash-unworld-it
Unset the world property (not yet coded)

> stis-hash-copy
Create a new hashtable (this is not yet coded)

IMPLEMENTATION
The implementation is not yet defined, we need to test it and try different versions. But One thing that one can notice are that by reducing the
size if the hashtable one can speed up all operations. So the idea is to store
a list of hash/index pairs in the hash and use varying sized cells for this. The
smallest cell is 2 bytes and we can use more bytes for the index (size of hash table) and fewer for the hash itself) this means that we could use a size of 64 for the hash value and allow 1024 items in the hash table. Similarly as the hash  grows we will use 32bit cells and split the maximal size of the hashtable between the hash value and the hash table size. The next size is 64bit cells and the last one are 128 bit cells. There will be a fallback hash table which is the usual a-list bin table But that one will be smaller memory wise and not in the hot path. 

I am in the process of doing the implementation and will see if I can find the optimal setting for all sizes which I can test. (basically to pin down the strategy for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the size of the hash can only be in those sizes.

Oh also, for small hashes the hash is essentially a single association list and superfast.

Maybe the dispatching overhead is too much, donough, perhaps trimming down the features might be needed.

I will see if we can find some use of special assembler operations, especially for small hashtables.

PRIMITIVES,
There are essentially three primitives
1. search        =  find a matching hash, return index or handle or #f
                           possible starting from a known position

2. add             =  add a new element that we now is not included 
                           in the hashtable

3. search-list   = fast alist search in case of a small assoc list case
                          returns the key-val handle or #f possibly restarts
                          from a known index.

Possibly one may define a search-and-set-some primitive, e.g. one can
in hot cases not need to fhe the add afterwards and add is only used in the
slow path in a set operation. (For other operations it is in the fast path though)

Happy Hacking!

Regards
Stefan





reply via email to

[Prev in Thread] Current Thread [Next in Thread]