guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: about strings, symbols and chars.


From: Michael Livshin
Subject: Re: about strings, symbols and chars.
Date: 24 Dec 2000 15:06:38 +0200
User-agent: Gnus/5.0807 (Gnus v5.8.7) XEmacs/21.1 (20 Minutes to Nikko)

"Jorgen 'forcer' Schaefer" <address@hidden> writes:

> I think that the whole problem of multi-byte vs. fixed-byte
> encoding is not much of a performance issue.  Fixed-byte strings
> are "simpler", and can be accessed randomly without performance
> overhead (you could provide a macro which extracts the width of a
> given string), but have problems regarding memory usage.  A
> single non-latin-1 charakter in a 4k string would make the whole
> string take up 8k (8bit to 16bit expansion), while in multi-byte
> it requires 4k+1 bytes (long strings are rather uncommon in
> usage, though).  Multi-byte strings have problems when it comes
> to setting the value of characters -- you might have to copy the
> rest of the string if it's size differs from the previous
> character size.  Fixed-width strings need only be copied if you
> put in a character which needs a "bigger" encoding than you had
> available before.
>
> [...]
> 
> Concluding, there's not much difference between the two
> representations.  I know this is a long mail just to say "hey,
> it's not much of a difference", but i guess i had to write it.
> Maybe someone can show me where i overlooked something?

I don't think you overlooked anything, per se.  but.

there was a very enlightening flame wa^W^Wdiscussion in the
emacs-related newsgroups recently, which provided (to me, at least) a
different way to look at the whole wide character/Unicode/MULE mess.

basically, in Emacs you have three kinds of objects dealing with
characters.  they are characters themselves, strings and buffers.

* characters themselves are not worth any discussion, there's nothing
  to compress.

* strings, at least in Emacs, don't tend to be very big (there are
  buffers, after all), so they can be just vectors of fixed-width wide 
  characters.

* buffers are the most interesting case.  they can be *big*.  so there 
  is a variety of choices.

** dumb fixed-width: enormous space overhead, but random access is of
   constant complexity.

** variable-width (like UTF-8 or somesuch): modest (none for US/UK)
   space overhead, random access is of linear complexity
   (optimizations are possible).  this is actually close to MULE,
   performance-wise.  I didn't see anyone complaining about the
   performance of Emacs20.

** smart fixed-width.  if you look at your Emacs buffers, you'll see
   that most of them contain no more than 256 different character
   codes (128, even more likely, but that doesn't save us anything).
   so you can have per-buffer code maps, such that for most buffers
   you'll need only 1 byte per character.  if that's not enough, than
   2 bytes, and probably never more than that.

now, since the vast majority of strings are very small, having
per-string fixed-width encodings is probably too much hair, but for
buffers it seems very reasonable.

so the question is, really, what we want from Guile i18n.  if the
emphasis is mostly on the future use in Emacs, then we have one set of
tradeoffs.  if the emphasis is on purely Scheme programming support
(no buffers, potentially enormous strings), then we have another set
of tradeoffs.  on the other hand, large strings surely deserve a
different representation than small strings, so the buffer-style
considerations may not be out of place for them, too.

we should also look at the new Common Lisp internationalization spec
(Allegro CL documentation includes it).

-- 
REALITY is an illusion that stays put.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]