bug-gperf
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gperf] One usability and 4 fine tuning performance suggestions (cac


From: Mark Stankus
Subject: [bug-gperf] One usability and 4 fine tuning performance suggestions (cache lines and array size)
Date: Wed, 4 Sep 2019 10:49:56 -0700

  I love gperf! I have used it and wish to use it more (see the
usability suggestion below).

I had one suggestion about usability and four suggestions about performance.
The performance suggestions could be implemented as options.
The usability suggestion is probably difficult.
Two of the performance suggestion are pretty easy to implement.
The third performance suggestion is more difficult to implement.
The fourth performance suggestions is also more difficult to implement.
The performance suggestions come from recently haven read parts of the
excellent paper
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

USABILITY SUGGESTION ( VERY INVOLVED and/or DIFFICULT )

  I would like to have a limited version of gperf which could be invoked
while running C or C++. I don't know how to do this,
but I understand what possible syntax might be.

   struct gperf
   {
      unsigned long MAX_WORD_LENGTH;
      unsigned long MIN_WORD_LENGTH;
      unsigned long MAX_HASH_VALUE:
      unsigned long MIN_HASH_VALUE;
      unsigned char * alloc_array;
      unsigned long * shifts; // 0 and 1 in the the perm2.gperf in
examples directory
      char ** words; // to borrow, but not own, an array.
  };

 char *the[] = { "hi", "there",""}; // I don't know if this is
syntactically correct.
// The "" acts as a "end of list" identifier
 gperf x;
  initialize_gperf(&x,the);
  unsigned int n1 = hash( "there", 5, x);
  unsigned int n2 = hash("dude",4, x);
  assert( n1 ); // found
  assert( ! n2 ); // not found
  destroy_gperf(x); // freeing alloc_array  and shifts, but not words


Instead of

   return len + asso_values[(unsigned char) str[1] + 1] +
assoc_values[(unsigned char)[str[0]];

in the hash function, we would have (in effect),

   return len + asso_values[(unsigned char) str[1]+x.shifts[1],
                                            (unsigned char) str[0] +
x.shifts[0]);

Of course, in general there would be a loop to create the value to return.
Here we are speaking of actual code in gperf, not code written to a file.
It would be slower than running all of gperf, but
sometimes I want to have a perfect hash based
on values which I create during the program.

SECOND SUGGESTION ( EASY )

  With 64 bit machines, "unsigned long" is probably more efficient to pass
around than "unsigned int". An option   --return_type = "unsigned int"
with --return_type = "unsigned long" could be helpful.

THIRD SUGGESTION ( EASY )

 The short version: when len is too large, use the value of len in the function
"hash" to avoid accessing the array.

  An if statement uses far fewer CPU cycles than loading a cache
line (which is usually a 64 byte chunk of memory) into the CPU.
Loading a cache line can take 100 CPU cycles. When a cache line
is unloaded by another part of the program or the operating system, the
cache line would need to be reloaded to be used again.

  Using the permut2.gperf example where the strings have a
maximal length of 2 and the largest valid hash value is 5,  instead of

return len + asso_values[(unsigned char) str[1] + 1] +
assoc_values[(unsigned char)[str[0]];

there could be

  return len > 2 ? 6 :
                           len + asso_values[(unsigned char) str[1] + 1]
                                 +  assoc_values[(unsigned char)[str[0]];

Perhaps invoked with an option like --use-length-to-avoid-array-reference
or something less verbose.


FOURTH SUGGESTION ( MEDIUM DIFFICULTY )

  I was wondering if it would be worthwhile to use an "if" statement
and a shorter static array asso_values in the "hash" function in some
cases. The reason would be that the array is 256 bytes and loading
different parts of the array could use up to 4 cache lines. Loading
caches lines is very expensive from the point of view of the CPU
(sometimes 100 cycles) as opposed to an if statement (which is a lot less,
but is blocking.

  I looked at permut2.gperf in the examples directory of the gperf
distribution. I will talk about this and next suggestion via examples
related to that file.

  In our case, the largest hash value is 5 and all but four entries in
the array are 6.
Recall that in the perumt2.gperf example, only the letters x, y and z are used.

  Instead of

  return len + asso_values[(unsigned char) str[1] + 1] +
assoc_values[(unsigned char)[str[0]];

in the hash function, there could be an optimized(?) version of the following

  unsigned int returned_value = 6;
  if ( len <= 2 && 'x' <= str[1] && str[1] <= 'z' &&  'x' <= str[0] &&
str[0] <= 'z' )
 {
    unsigned char first = (unsigned char) str[1] - 'x' + 1;
    unsigned char second = (unsigned char) str[0] - 'x';
    returned_value len + modified_asso_values[first] +
modified_asso_values[second];
  }
  return  returned_value;


If the length is 3 or more, the array is not accessed at all.
If either letter is out of range, the array is not accessed at all.
Also the array modified_asso_values would have only 4 unsigned chars.

-- OR --

unsigned int returned_value = 6;
if ( len <= 2 && 'x' <= str[1] && str[1] <= 'z' &&  'x' <= str[0] &&
str[0] <= 'z' )
 {
    unsigned char first = (unsigned char) str[1] - 'x' + 1;
    if ( len + first  < 6 )
    {
      unsigned char second = (unsigned char) str[0] - 'x';
      returned_value= len + modified_asso_values[first] +
modified_asso_values[second];
    }
  }
  return  returned_value;


Perhaps invoked with an option like
--use-length-and-first-letter-to-avoid-array-references
or something less verbose.

FIFTH SUGGESTION ( MORE THAN MEDIUM DIFFICULT )

  This suggestion is motivated by examples where the word_list is
somewhat short. Short means that the log base 2 of the length of the
word list is small.

  Let's say the maximum hash key is 5 like in example permut2.gperf.
If we round up to the next power of 2, we get 8. Recall 8 = 2^3.
So we only need 3 bits to store a number from 0 to 7. To store
256 three bit chunks, we would need a minimum of 768 bits.
That is, 96 bytes. That is, 1.5 cache lines. This is probably no
better than using 2 cache lines.

   If an unsigned long long (which on my machine is 8 bytes
which is 64 bits), we could easily store 21 three bit chunks and
leave one bit unused. If we have stored these 21 three bit
chunks into the variable x, then we could do

  unsigned long long chunk1 = (x << (8*8-3)) >> (8*8-3);
  unsigned long long chunk2 = (x << (8*8-6)) >> (8*8-3);
  unsigned long long chunk3 = (x << (8*8-9)) >> (8*8-3);

The << forces there to be 3 possibly nonzero bits as the
most significant bits and the remaining bits to be zero.
The >> forces the 3 possibly nonzero bits to be relocated as the least
significant bits. I don't know if there is a BIG/LITTLE ENDIAN
concerns here. I use unix and have not looked into that whole thing.
One of the 8s is sizeof(unsigned long) and the other 8 is CHAR_BIT
from limits.h.

  If you had an array of unsigned longs and want the n-th
chunk of 3 bits, you could compute  (n/21) to get the
index into the array and choose the (n % 21)-th three bit
chunk.

  Unfortunately, n/21 and n%21 are not super fast to compute because
21 is not powers of 2. If, instead, we stored 16 three bit
chunks into an unsigned long long (with 64-16*3 ignored bits),
then you would use n/16 and n%16 instead. Now,
n/16 is implemented as a bit shift (which is really fast)
and n%16 is implemented as a bit and (which is also really fast).
This is because 16 is a numerical literal  which is a power of 2.

  More generally, if  lenchunk = 3 and numchunksperentry = 16 = 2^4
(16 is a power of 2 and less than 21), then you would
use (n >> 4) for the index into the array
and (n & ( (static_cast<unsigned long long>(1) << 4) -1 );
to describe which chunk you want.
See https://graphics.stanford.edu/~seander/bithacks.html for example.


Perhaps invoked with an option like --use-bit-packing-to-reduce-cache-line-hits
or something less verbose.

--------------------------

  I could help with all of this if someone points me in the right direction
and there is an interest in seeing some or all of it done.


Mark Stankus



reply via email to

[Prev in Thread] Current Thread [Next in Thread]