[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: about strings, symbols and chars.
From: |
Michael Livshin |
Subject: |
Re: about strings, symbols and chars. |
Date: |
24 Dec 2000 15:06:38 +0200 |
User-agent: |
Gnus/5.0807 (Gnus v5.8.7) XEmacs/21.1 (20 Minutes to Nikko) |
"Jorgen 'forcer' Schaefer" <address@hidden> writes:
> I think that the whole problem of multi-byte vs. fixed-byte
> encoding is not much of a performance issue. Fixed-byte strings
> are "simpler", and can be accessed randomly without performance
> overhead (you could provide a macro which extracts the width of a
> given string), but have problems regarding memory usage. A
> single non-latin-1 charakter in a 4k string would make the whole
> string take up 8k (8bit to 16bit expansion), while in multi-byte
> it requires 4k+1 bytes (long strings are rather uncommon in
> usage, though). Multi-byte strings have problems when it comes
> to setting the value of characters -- you might have to copy the
> rest of the string if it's size differs from the previous
> character size. Fixed-width strings need only be copied if you
> put in a character which needs a "bigger" encoding than you had
> available before.
>
> [...]
>
> Concluding, there's not much difference between the two
> representations. I know this is a long mail just to say "hey,
> it's not much of a difference", but i guess i had to write it.
> Maybe someone can show me where i overlooked something?
I don't think you overlooked anything, per se. but.
there was a very enlightening flame wa^W^Wdiscussion in the
emacs-related newsgroups recently, which provided (to me, at least) a
different way to look at the whole wide character/Unicode/MULE mess.
basically, in Emacs you have three kinds of objects dealing with
characters. they are characters themselves, strings and buffers.
* characters themselves are not worth any discussion, there's nothing
to compress.
* strings, at least in Emacs, don't tend to be very big (there are
buffers, after all), so they can be just vectors of fixed-width wide
characters.
* buffers are the most interesting case. they can be *big*. so there
is a variety of choices.
** dumb fixed-width: enormous space overhead, but random access is of
constant complexity.
** variable-width (like UTF-8 or somesuch): modest (none for US/UK)
space overhead, random access is of linear complexity
(optimizations are possible). this is actually close to MULE,
performance-wise. I didn't see anyone complaining about the
performance of Emacs20.
** smart fixed-width. if you look at your Emacs buffers, you'll see
that most of them contain no more than 256 different character
codes (128, even more likely, but that doesn't save us anything).
so you can have per-buffer code maps, such that for most buffers
you'll need only 1 byte per character. if that's not enough, than
2 bytes, and probably never more than that.
now, since the vast majority of strings are very small, having
per-string fixed-width encodings is probably too much hair, but for
buffers it seems very reasonable.
so the question is, really, what we want from Guile i18n. if the
emphasis is mostly on the future use in Emacs, then we have one set of
tradeoffs. if the emphasis is on purely Scheme programming support
(no buffers, potentially enormous strings), then we have another set
of tradeoffs. on the other hand, large strings surely deserve a
different representation than small strings, so the buffer-style
considerations may not be out of place for them, too.
we should also look at the new Common Lisp internationalization spec
(Allegro CL documentation includes it).
--
REALITY is an illusion that stays put.