guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: string-map arg order


From: Dirk Herrmann
Subject: Re: string-map arg order
Date: Fri, 7 Sep 2001 00:52:33 +0200 (MEST)

On 6 Sep 2001, Alex Shinn wrote:

>     Dirk> Again: Both string representations can be used to perform
>     Dirk> the same kind of tasks.  Fixed width encodings can use up
>     Dirk> more memory and may need additional effort converting from
>     Dirk> and to them (given that the rest of the world uses a
>     Dirk> different format).  Iteration and a lot of other operations
>     Dirk> can, however, be performed in a comparably efficient way.
>     Dirk> Variable width encodings can be more space efficient, but
>     Dirk> many (and many very common) operations are worse (by a
>     Dirk> complexity factor of O(n)) with respect to execution time.
>     Dirk> You could, however, save some conversion overhead if the
>     Dirk> rest of the world uses the same format.
> 
> All true.  I'd like to add that conversions are not only forced by
> external considerations, but internal ones (e.g. appending different
> byte-sized strings).  Also, performance is not the only concern.
> There's the simplicity of the API (Guile used to have multiple string
> representations, but no one used them), and the ease with which people
> can upgrade.  There's also the complexity of the coding involved, and
> the fact that everything using strings (symbols, ports, regexps, etc.)
> will have to be able to handle all forms of strings.

I have been working with guile for some years now, and I have heard that
argument before.  The multiple string representation that is cited in
these discussions was never part of guile during the time when I worked
with it.  I have never taken the time to look for it in old archives.  
But, I claim that such an interface can also be defined in a way that
people use _one_ code for all representations.

Guile is today much cleaner than it was some years ago:  We now have
SCM_STRING_CHARS and SCM_SYMBOL_CHARS where before we only had SCM_CHARS,
and in some places even SCM_VELTS had been used for strings (ugh!).  We
have SCM_STRING_LENGTH, SCM_SYMBOL_LENGTH, SCM_VECTOR_LENGTH and some
more, which before were all merged into SCM_LENGTH.  We now have only one
symbol type, where before we had ssymbols and msymbols.  And so on...
Thus, switching to a different representation for most of guile's internal
data types is much easier than before.  In this respect I don't want to
base today's design decisions on claims about historical attempts to
implement some other string representation.


Probably the best solution would be to give both approaches a try, measure
the performance and memory implications and then, again, go into a new
round of discussions.


And, btw., there is even a more general solution:  Virtualizing the string
interface to allow for _multiple_ string representations using a common
interface.  The virtual function table could hold the following entries:

  - read character #n, the result would be a scheme character object
  - write a given scheme character object into character position #n

This should be sufficient to implement almost everything, but for the sake
of performance, there could certainly be more functions that do longer
operations in one go, like computing a substring, filling a string,
comparing strings, capitalizing, downcasing, upcasing, copying, appending,
converting into a predefined set of standard representations like utf8,
ASCII, Isolatin, ...

Comparing strings, for example, would work like follows:

/* This function is called if s1 is known to be a utf8 string.  Nothing
 * is known about the string type of s2.  It is just known to be a string.
 * Generic checks, like whether the two strings have the same length, have
 * been performed before the type dispatch.  */
SCM
string_equal_utf8_unknown_p (SCM s1, SCM s2)
{
  if (SCM_STRING_ENCODING_UTF8_P (s2))
    string_equal_utf8_utf8 (s1, s2);  /* this should be fast! */
  else {
    const char *utf_ptr = SCM_STRING_CHARS (s1);
    for (i = 0; i != SCM_STRING_LENGTH (s1); ++i) {
      /* here, we can access the utf8 characters of s1 with maximum
       * performance, but may have to use the virtual character access
       * function for each character in s2  */
      ...
    }
  }
}


Best regards
Dirk Herrmann




reply via email to

[Prev in Thread] Current Thread [Next in Thread]