guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: about strings, symbols and chars.


From: Dirk Herrmann
Subject: Re: about strings, symbols and chars.
Date: Thu, 30 Nov 2000 12:54:42 +0100 (MET)

On 29 Nov 2000, Jim Blandy wrote:

> In August 1999 I set up a plan to add multilingual text support to
> Guile.  The plan was intended to allow Guile to arbitrarily mix text
> from different languages in strings.

For the curious:  the proposal can be found under:
  guile-doc, ref/mbapi.texi

BTW:  I don't quite see the problems about using several different
fixed-width encodings instead of a multibyte encoding.  It is claimed that
"with n different fixed-string encodings, users would have to write n
versions of any code that manipulates strings directly".  I don't
understand this claim, or, stated differently, I don't see why it
shouldn't be possible to have a generic API working on different
fixed-width encodings?  In the following example I assume that the
encoding width is stored in the most significant bits of the string
object's type cell and that only widths up to 4 are possible.  The strings
in the example have a maximum length of 2^22 characters.  This is just an
example.  A different string object layout can be chosen, where there are
no such restrictions.

#define SCM_STRING_LENGTH(s) ((SCM_CELL_WORD_0 (s) >> 8) & 0x3fffff)
#define SCM_STRING_ENCODING_WIDTH(s) (((SCM_CELL_WORD_0 (s) >> 30) & 3) + 1)
#define SCM_STRING_BASE(s) ((unsigned char *) SCM_CELL_WORD_1 (s))
#define SCM_CHAR_GET(p, w) \
  (w == 1 ? (unsigned long int) (* (unsigned char *) p) \
          : (w == 2 ? (unsigned long int) (* (unsigned short *) p) \
                    : (* (unsigned lont int *) p)))

SCM
compare (SCM str1, SCM str2)
{
    unsigned long int lenght1 = SCM_STRING_LENGTH (str1);
    unsigned long int lenght2 = SCM_STRING_LENGTH (str2);
    unsigned char *base1 = SCM_STRING_BASE (str1);
    unsigned char *base2 = SCM_STRING_BASE (str2);
    unsigned int width1 = SCM_STRING_ENCODING_WIDTH (str1);
    unsigned int width2 = SCM_STRING_ENCODING_WIDTH (str2);
    unsigned long int i;

    if (length1 != length2) return SCM_BOOL_F;
    for (i = 0; i != length1; ++i)
      {
        scm_char_t c1 = SCM_CHAR_GET (base1, width1);
        scm_char_t c2 = SCM_CHAR_GET (base2, width1);
        if (c1 != c2) return SCM_BOOL_F;
        base1 += width1;
        base2 += width2;
      }
    return SCM_BOOL_T;
}

Maybe there are situations where things are not that simple.  And, maybe
there is some performance overhead due to the implementation of
SCM_CHAR_GET, but I doubt that an implementation of scm_mb_get (as
described in the cited proposal) can be any cheaper.  Can anybody
enlight me about situations where implementors would really have to
provide n different implementations, and where that can't be solved by
providing a carefully chosen set of macros as shown in the example?

(Note:  The code above can be sped up if both strings that are to be
compared have the same width and use the same encoding.  For this special
case it may make sense to have a second implementation that does not use
SCM_CHAR_GET, but instead just does a memcmp.  However, these are just
_two_ implementations, not _n_, and the second one is an optional
performance improvement.)

> There are two possible interesting encodings:
[...]
> But the really fascinating thing is, you don't actually need to
> choose.  Both these encodings have all the nice properties you need to
> manipulate them easily in C code:
> 
> - You can use them both in null-terminated strings.  strcpy and strlen
>   work.
> - You can use strchr and memchr to scan them for ASCII characters.
> - You can use strstr to search them for arbitrary substrings, even
>   substrings containing multi-byte characters.
> - You never need to maintain "state" while scanning a string; you can
>   always find character boundaries in finite time, just given a
>   pointer into the middle of a string.
> - You can tell how many bytes a character's encoding takes by looking
>   at the first byte alone.

If guile switches to a copy-on-write string representation with implicitly
shared substrings, then a guile string is not necessarily null-terminated,
because any string may point into the middle of some other string.  There
will always at some point be a null, namely if the superstring ends, but
this still means, that we would have to be careful to use standard C
library string handling functions.  However, this is already true today,
since scheme strings can contain null characters.

Dirk Herrmann




reply via email to

[Prev in Thread] Current Thread [Next in Thread]