guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wide strings


From: Mike Gran
Subject: Re: Wide strings
Date: Mon, 26 Jan 2009 21:38:42 -0800 (PST)

Hello,

> Ludo' sez

>> Mike Gran <address@hidden> writes:

> BTW, Gnulib has a wealth of modules that could be helpful here:

>  http://www.gnu.org/software/gnulib/MODULES.html#posix_ext_unicode

> I used a few of them in Guile-R6RS-Libs to implement `string->utf8'
> and such like.

The Gnulib routines seem perfectly complete.  I was unaware of them.
It wasn't clear to me at first glance if wide regex is supported, but,
otherwise, they are fine.

>> There could then be a separate iterator and function set that does
>> (likely O(n)) operations on the grapheme clusters of strings.  A
>> grapheme cluster is a single written symbol which may be made up of
>> several codepoints.  Unicode Standard Annex #29 describes how to
>> partition a string into a set of graphemes.[1]

> Hmm, that seems like a difficult topic.  It's not even mentioned in
> SRFI-13.  I suppose it can be addressed at a later stage, possibly
> by providing a specific API.

Fair enough.  With wide strings in place, this could all be done in
pure Scheme anyway, and end up in some library.  I brought it up
really just to note the codepoint / grapheme problem.

> [...] We need to look more closely at what Gnulib has to offer, IMO.

Gnulib works for me.  Bruno is the maintainer of those funcs, so I'm
sure they work great.

So really the first questions to answer are the encoding question and
whether the R6RS string API is the goal.  

For the latter, I think the R6RS / SRFI-13 is simple enough.  I like
it as a goal.

For the former, I rather like the idea that internally a string will
internally be encoded either as 4-byte chars of UTF-32 or 1-byte chars
of ISO-8859-1.  Since the first 256 chars of UTF-32 are ISO-8859-1, it
makes it trivial for string-ref/set to work with codepoints.

(Though, such a scheme would force scm_take_locale_string to become
scm_take_iso88591_string.)

> Thanks,
> Ludo'.

Thanks,
Mike




reply via email to

[Prev in Thread] Current Thread [Next in Thread]