[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: virtualizing vectors and strings
From: |
Tom Lord |
Subject: |
Re: virtualizing vectors and strings |
Date: |
Mon, 24 Sep 2001 20:27:02 -0700 (PDT) |
We've had a lot of discussions what kind of string representation
would be most appropriate. However, we did not come to an
agreement, although many favor the utf-8 representation since it is
used in a lot of other projects.
That's very limiting. Network traffic and file names will typically
be UTF-8, for example. Some really string-intensive apps will want to
use UTF-32. A lot of files and other data will be in (8-bit) ISO
8859-*. UTF-16 makes a good space/time trade-off in many
circumstances, and can speed up data exchange with Java.
Encoding form changes are far from free. Therefore, you'll want
several low-level string representations and the ability to mix and
match them within a single application.
This is close to your suggestion:
Using a similar approach as with smobs will allow to load optimized
implementations for certain string representations on demand, like in my
case for the isolatin-1 representation. Users can then, depending on
their location, choose for themselves which representations they will
use internally.
I think that's essentially the right idea, but I would do it below the
Scheme interpreter, since the need is not unique to Scheme.
Dynamically loading implementations is very likely to be overkill:
they won't be *that* large; if you have specialized embedded
applications in mind, I'd worry more about clean static linking of
implementation subsets. (Is that clear or too terse?)
The run-time system project for which I sent you an announcment and
call-for-support/participation includes fully general low-level string
ops for all popular encoding forms (and for mixing and matching
encoding forms). A clean API and implementation is needed for C as
well as Scheme. I have a spec somewhere for the lower level API, and
some junk (not yet tested, not yet complete) implementation. I have
some clean (tested, probably stable) code in release for the core of
a Unicode character properties database.
Unicode complicates strings a lot (e.g. `string-ref' in the face of
surrogates in UTF-16). Strings have gone from being slightly
specialized arrays to being something much trickier. I think it
is worth dealing with the complexity since the Unicode consortium
genuinely seems to be solving the universal character set problem
well, for the long-term. It does take some effort to puzzle out
the clean-as-possible core of their design from the parts that
you can safely ignore.
See? You really want a libc replacement under your Scheme
implementation. (And Unicode strings are just one reason, of many.)
As a point of curiosity, do any of the core Guile hackers have a copy
of the Unicode standard (v 3.0 or later)? Is the Guile project or any
core hacker a consortium member? Are you all familiar with the
resources available at www.unicode.org?
-t