guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode and Guile


From: Marius Vollmer
Subject: Re: Unicode and Guile
Date: Wed, 12 Nov 2003 01:06:39 +0100
User-agent: Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux)

Please allow me to randomly dump my thoughts on Guile and Unicode:

- The principal tension that I see is between having a memory
  efficient representation (UTF-8) and one that is simple and
  concept-compatible with the old way (fixed-width, maybe UTF-32).

- But is there a fixed-width Unicode representation?  I.e., is UTF-32
  just like ASCII only with more bits or is there more to it?  Are
  there combining characters in UTF-32?  If there are, then there is
  no reason to go looking for a fixed-width, old-style text
  representation.

- If we go with a variable width encoding, we can just as well use
  UTF-8 and replace strings/chars with something new, like Tom's
  texts/graphemes.

- What kind of data type are strings anyway?  Vectors or lists?
  Traditionally, they have been mutable vectors, but variable-width
  encoding of 'characters' might force us to rethink this, in general.
  People expect constant time accesses for vector-like things, but we
  will probably not want to guarantee them for a variable-width
  encoding (with integers as indices).

- So the text/grapheme API should maybe be more abstract, and not be
  using integers to refer to graphemes contained in texts but some
  opaque 'iterator', 'subtext' or 'grapheme range' thing.
  
- Shared subtexts or grapheme ranges are easy to do for read-only
  texts, but harder for mutable text.  So texts should maybe be
  unmutable by default.  Mutable texts and pointers into it might use
  a more expensive data structure, like a gap buffer.

- For Guile specifically, the problematic thing is the C API.  Right
  now, strings are pretty much fixed to be vectors of unsigned bytes.
  We can't do much about this without breaking code.  So from that
  point of view, a new API for Unicode stuff looks like a good thing
  as well, when we can convince ourselves that people are willing to
  move over to that new API.

- The representation of texts would be determined by what is most
  natural for existing C code.  I.e., I think that Gtk+ uses UTF-8 and
  when we find that most libraries that we want to access from Guile
  use UTF-8 as well, we should make our text representation UTF-8.

- Old code can be supported by allowing string-*, char-*, etc. to work
  on UTF-8 encoded texts that uses only ASCII code points.  That will
  causes problems to the 8-bit users (like latin-1, etc.), tho.  C
  code must avoid storing non-ASCII characters into such strings, and
  I'm not sure right now whether we can keep it from doing that in a
  compatible way.

- ... :)

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405




reply via email to

[Prev in Thread] Current Thread [Next in Thread]