guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: string-map arg order


From: Alex Shinn
Subject: Re: string-map arg order
Date: 04 Sep 2001 20:55:05 -0400
User-agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/21.0.104

>>>>> "Dirk" == Dirk Herrmann <address@hidden> writes:

    Dirk> Further, you would not start by making everything utf-32.
    Dirk> Rather, you would start with a 1-byte width and only
    Dirk> increase width as necessary, which is at most 2 times:
    Dirk> 1->2->4.  With a variable width encoding, you may have to
    Dirk> increase the size (n * (m-1)) times, n being the string
    Dirk> length, m being the maximum character width.  Further,

In the context of multi-threading, I'm not sure resizing is even an
option.  For whatever API we choose, ultimately external C library
functions will be given a pointer to the characters (char or wchar) of
a string.  If we reallocate the string from another thread, that
pointer will then be invalid.

An alternative implementation is to always allocate 4 bytes per
character (with *either* fixed- or variable-byte), and expand in place
as needed.  Why work with single-byte strings in 4x the space?  So
that you don't have to convert when passing to C functions.  What
steered me away most from fixed-width encodings is coming up with a
decent API.  The rest of the world (other languages, GTK, FreeType,
Linux itself) are moving to utf8 - if we choose another encoding,
we'll have to convert data types back and forth constantly.  And the
possibility of different string types or wide strings means all
current extensions would have to update to the new API right away -
with utf8 they're safe so long as they stick to ASCII, and could
upgrade at their leisure.

One of the major problems seems to be that efficient handling of
unicode and an easy API are mutually conflicting goals.

    Dirk> with a variable width encoding, the function make-string
    Dirk> _requires_ that you initialize the allocated string with
    Dirk> characters

Yes, my current implementation already does this.  I don't really
think it's significantly slower (just a memset), nor is it likely
something we need to be optimal.

    [...]

    Dirk> I don't quite understand your point about a 'dirty bit':
    Dirk> This does not work without using mutecis anyway, so what is
    Dirk> the difference?

Agreed.  All of my suggestions about "more room for optimization" were
very hand-wavy and not thought out at all.  Please disregard them, and
until someone comes up with concrete algorithms to the contrary, let's
assume with a variable width encoding string access is O(n) and
string-for-each is O(n^2).  There are a lot of other factors to take
into account, such as the price of the frequent conversions that would
be caused when going from a non-utf8 representation to any external
library and/or Internet protocol.

But at this point I think I'll round out the utf8 implementation and
put together some benchmarks (lies, I know, but sometimes closer to
the truth than theory is).

-- 
Alex Shinn <address@hidden>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]