bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gperf-bugs] Patch to gperf 2.7.2 for case-independent keyword matc


From: Bruce Lilly
Subject: Re: [Gperf-bugs] Patch to gperf 2.7.2 for case-independent keyword matching
Date: Fri, 15 Feb 2002 11:44:17 -0500

Bruno Haible wrote:
> 
> Bruce Lilly writes:
> > In several important situations, keywords must be matched
> > independent of case (email header field names, domain names,
> > etc.).
> 
> The recommended way to do this is to convert the keywords to upper
> case (or lower case, as you like) before passing them to gperf.

The patch has been developed after careful consideration of the
alternatives and their implications.
 
Case conversion of keywords precludes storing them in a canonical
form, e.g. "MIME-Version" (or, at minimum, requires double the
space and additional programmer work to store canonical forms in
a structure along with the case-converted form).
 
It also requires conversion of keys for each lookup, and the case
conversion at run-time must be the same used for the keyword list
at compile time (including locale issues for 8-bit characters).
It requires extra space (for the converted copy of the string)
and additional function calls, not to mention additional work for
the programmer. One purpose of gperf is to simplify program
maintenance.

For a single hash function per compilation unit, the application
interface (the one called which does the case conversion, then
calls the gperf-generated function) has to provide a buffer for
the conversion (since the first argument points to a constant
string which cannot be modified). Depending on gperf flags, it
may be possible to determine the maximum size required. However,
since the key and length passed by the application may be
arbitrary, it will be necessary to check the length on each
character conversion to avoid buffer overflow and a segmentation
fault, or replicate the length check vs. MAX_WORD_LENGTH in the
application interface function. Alternatively, one could allocate
a buffer of the size given by the length argument using malloc,
but that is hardly likely to result in spectacular performance.
Moreover, both the gperf-generated case-dependent lookup
function and the application interface are externally visible
functions, which may result in namespace collisions or the use of
the wrong function, and is poor programming practice (it would
help is there were a gperf option to make the gperf-generated
function static).

For multiple hash functions per compilation unit, the table sizes
must be declared in an enum (--enum) to avoid compilation errors
or extraneous warnings, and the application interface must use
an arbitrary buffer size or use malloc.  The issue of visibility
of multiple lookup functions and namespace issues remains.

> > The attached patch to gperf 2.7.2 provides the option to
> > generate output which does case-independent matching and
> > documents that option.
> 
> gperf is a tool for maximum speed lookup. Your comparison function
> uses strcasecmp or strncasecmp at run time. This will not only
> case-convert the string being looked up more than once (namely, once
> in each comparison),

You are making unwarranted assumptions about how strcasecmp works.
Moreover, there are only multiple comparisons when there are hash
collisions. There is no case conversion involved in computing the
hash value, and if the hash value computed for a key is outside of
the range for the compiled keywords, there is no strcasecmp call.

> but also case-convert the fixed keywords (which
> you already should have case-converted before running gperf).

Again, unwarranted assumptions about the strcasecmp implementation.
And issues mentioned above w.r.t. canonical forms, compile-time vs.
run time issues, namespace issues, etc.

> In summary, this patch runs counter the general objective of gperf of
> producing fast code.

I strongly suspect that you are making off-the-cuff remarks rather
than comments resulting from actual performance tests.

Let's look at the relevant cases:
1. unsuccessful lookup, hash value out of range.
   Your "recommended way" is slower than the patch as yours requires
   copying and conversion of the key and additional function calls.
   The patch results in no copying, no conversions and no strcasecmp
   calls.
2. unsuccessful lookup, hash value in range but results in empty
   wordlist string.
   Your "recommended way" has the same overhead as above.  My patch
   results in one call to strcasecmp which returns immediately as
   one of the strings is empty.  The patch is still faster than your
   method.
3. unsuccessful lookup, key doesn't match keyword.
   Your "recommended way" has the same overhead as above; N.B. every
   character in the key must be examined, copied, and converted.
   strcasecmp returns early at the first mismatch and therefore need
   not examine every character (and in proper implementations, there
   is no copying and table lookup is used rather than conversion).
4. successful lookup, no hash collisions.
   The one successful strcasecmp call is still cheaper than your
   "recommended way" of copying and converting the string since
   strcasecmp does no copying and requires no additional function
   call overhead.
5. successful lookup, hash collisions.
   In this least likely scenario (if there are many hash collisions,
   an alternative to hash lookup, such as binary search, may be more
   efficient), there may be multiple strcasecmp calls. Each call will
   return at the first mismatch. Unlike your "recommended way", there
   is no string copying and no additional function call overhead. The
   patch is likely to provide better performance than your "recommended
   way" except in case of quite long hash chains on long keywords with
   identical prefixes (in which case, hash lookup is not a good method).
   
> Other than that, the use of strcasecmp and strncasecmp may not work if
> the keyword set contains 8-bit characters and the locale at run time
> is set differently than at compile time.

For the specific applications mentioned, characters outside of the
US-ASCII range are forbidden and therefore irrelevant.  As mentioned
above, there are compile-time vs. run time issues for your "recommended
way" also, and they are no different in one method vs. the other. In
fact, it may well be impossible to use hash techniques for case-
independent table lookup where the table contains non-ASCII characters,
unless the hash table is built at run-time in the same locale as the
lookup function (i.e. gperf is not applicable in such a case since the
hash table is pre-built).

> Bruno

Best regards,
  Bruce Lilly



reply via email to

[Prev in Thread] Current Thread [Next in Thread]