guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: virtualizing vectors and strings


From: Tom Lord
Subject: Re: virtualizing vectors and strings
Date: Mon, 24 Sep 2001 20:27:02 -0700 (PDT)

       We've had a lot of discussions what kind of string representation
       would be most appropriate.  However, we did not come to an
       agreement, although many favor the utf-8 representation since it is
       used in a lot of other projects.


That's very limiting.  Network traffic and file names will typically
be UTF-8, for example.  Some really string-intensive apps will want to
use UTF-32.  A lot of files and other data will be in (8-bit) ISO
8859-*.  UTF-16 makes a good space/time trade-off in many
circumstances, and can speed up data exchange with Java.

Encoding form changes are far from free.  Therefore, you'll want
several low-level string representations and the ability to mix and
match them within a single application.

This is close to your suggestion:

  Using a similar approach as with smobs will allow to load optimized
  implementations for certain string representations on demand, like in my
  case for the isolatin-1 representation.  Users can then, depending on
  their location, choose for themselves which representations they will
  use internally.

I think that's essentially the right idea, but I would do it below the
Scheme interpreter, since the need is not unique to Scheme.
Dynamically loading implementations is very likely to be overkill:
they won't be *that* large; if you have specialized embedded
applications in mind, I'd worry more about clean static linking of
implementation subsets.  (Is that clear or too terse?)

The run-time system project for which I sent you an announcment and
call-for-support/participation includes fully general low-level string
ops for all popular encoding forms (and for mixing and matching
encoding forms).  A clean API and implementation is needed for C as
well as Scheme.  I have a spec somewhere for the lower level API, and
some junk (not yet tested, not yet complete) implementation.  I have
some clean (tested, probably stable) code in release for the core of
a Unicode character properties database.

Unicode complicates strings a lot (e.g. `string-ref' in the face of
surrogates in UTF-16).  Strings have gone from being slightly
specialized arrays to being something much trickier.  I think it
is worth dealing with the complexity since the Unicode consortium
genuinely seems to be solving the universal character set problem
well, for the long-term.  It does take some effort to puzzle out
the clean-as-possible core of their design from the parts that
you can safely ignore.

See?  You really want a libc replacement under your Scheme
implementation.  (And Unicode strings are just one reason, of many.)

As a point of curiosity, do any of the core Guile hackers have a copy
of the Unicode standard (v 3.0 or later)?  Is the Guile project or any
core hacker a consortium member?  Are you all familiar with the
resources available at www.unicode.org?

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]