guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Guile 1.9.2 released (alpha)


From: Mike Gran
Subject: Re: GNU Guile 1.9.2 released (alpha)
Date: Mon, 17 Aug 2009 12:19:53 -0700 (PDT)

> From: Linas Vepstas <address@hidden>
> 
> 2009/8/15 Ludovic Courtès :
> 
> >  ** Incomplete support for Unicode characters and strings
> >
> >  Internally, strings are now represented either in the `latin-1'
> >  encoding, one byte per character, or in UTF-32, with four bytes per
> >  character.
> 
> Will this eventually move to UTF8? European languages typically
> use only a small handful of non-latin symbols, typically just misc
> punctuation.  I recent dump of voice-of-america  radio broadcasts
> I ran through guile used misc UTF8 punctuation ... backwards-facing
> double-quotes, ellipsis, etc.  I'd hate to see this common case
> blow up to 32-bits per char just to accommodate stray punctuation.
> 

We've gone back and forth on this.  I think, for the near term, in
terms of my personal involvement with Guile, the answer is 'no'.  

The problem lies in in the R6Rs requirement that functions like 
string-ref run in 'constant time'.  To implement that with UTF-8
requires that internally a string be both character storage and
an indexing scheme to point to the start of the characters in 
that storage.

If it weren't for the 'constant time' requirement, UTF-8 would
have been the winner hands down.

It is likely true that a UTF-8 character store and indexing scheme 
could be created that would be less memory-intensive that UTF-32
alone and still allow for constant time accesses.  But, in my 
uniformed opinion (since I haven't tried to implement such a 
thing myself), I don't think the memory savings would be 
so much less than UTF-32 to justify the complexity.

However, in terms of 2.0, the Unicode patches that I'll be asking
Ludo, Andy, and Neil to review will leave Guile in better 
position to swap out internal string representations, so 
a future switch to UTF8 wouldn't be as painful as this switch has
been so far.

> --linas

-Mike




reply via email to

[Prev in Thread] Current Thread [Next in Thread]