Re: Wide strings

guile-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wide strings

From:	Ludovic Courtès
Subject:	Re: Wide strings
Date:	Mon, 26 Jan 2009 22:40:12 +0100
User-agent:	Gnus/5.11 (Gnus v5.11) Emacs/22.3 (gnu/linux)

Hello,

Mike Gran <address@hidden> writes:

> There are 3 good, actively developed solutions of which I am aware.
>
> 1.  Use GNU libc functionality.  Encode wide strings as wchar_t.

That'd be POSIX functionality, actually.

> 2.  Use GLib functionality.  Encode wide strings as UTF-8.  Possibly
> give up on O(1).  Possibly add indexing information to string to allow
> O(1), which might negate the space advantage of UTF-8.

Technically, depending on GLib would seem unreasonable to me.  :-)

BTW, Gnulib has a wealth of modules that could be helpful here:

  http://www.gnu.org/software/gnulib/MODULES.html#posix_ext_unicode

I used a few of them in Guile-R6RS-Libs to implement `string->utf8' and
such like.

> 3.  Use IBM's ICU4c.  Encode wide strings as UTF-16.  Thus, add an
> obscure dependency.
>
> Option 3 is likely a non-starter, because it seems that Guile has
> tried to avoid adding new non-GNU dependencies.  It is technologically
> a great solution, IMHO.

At first sight, I'd rather avoid it as a dependency, if that's possible,
but that's mostly subjective.

> Let's say that a string is a union of either an ASCII char vector or a
> wchar_t vector.  A "character" then is just a Unicode codepoint.
> String-ref returns a wchar_t.  This is all in line with R6RS as I
> understand it.

Yes, that seems easily doable.

> There could then be a separate iterator and function set that does
> (likely O(n)) operations on the grapheme clusters of strings.  A
> grapheme cluster is a single written symbol which may be made up of
> several codepoints.  Unicode Standard Annex #29 describes how to
> partition a string into a set of graphemes.[1]

Hmm, that seems like a difficult topic.  It's not even mentioned in
SRFI-13.  I suppose it can be addressed at a later stage, possibly by
providing a specific API.

> There is the problem of systems where wchar_t is 2 bytes instead of 4
> bytes, like Cygwin.  For those systems, I'd recommend
> restricting functionality to 16-bit characters instead of trying to
> add an extra UTF-16 encoding/decoding step.  I think there should
> always be a complete codepoint in each wchar_t.

Agreed.  The GNU libc doc concurs (info "(libc) Extended Char Intro").

However, given this limitation, and other potential portability issues,
it's still unclear to me whether this would be a good choice.  We need
to look more closely at what Gnulib has to offer, IMO.

Thanks,
Ludo'.

[Prev in Thread]

Current Thread

[Next in Thread]

Wide strings, Mike Gran, 2009/01/25
- Re: Wide strings, Ludovic Courtès, 2009/01/25
  - Re: Wide strings, Neil Jerram, 2009/01/25
    - Re: Wide strings, Ludovic Courtès, 2009/01/26
  - Re: Wide strings, Mike Gran, 2009/01/25
    - Re: Wide strings, Mike Gran, 2009/01/26
    - Re: Wide strings, Ludovic Courtès <=
    - Re: Wide strings, Mike Gran, 2009/01/27
    - Re: Wide strings, Mike Gran, 2009/01/27
    - Re: Wide strings, Andy Wingo, 2009/01/27
    - Re: Wide strings, Ludovic Courtès, 2009/01/27
    - Re: Wide strings, Mike Gran, 2009/01/28
    - Re: Wide strings, Andy Wingo, 2009/01/28
    - Re: Wide strings, Ludovic Courtès, 2009/01/28
    - Re: Wide strings, Neil Jerram, 2009/01/29
    - Re: Wide strings, Clinton Ebadi, 2009/01/28
    - Re: Wide strings, Ludovic Courtès, 2009/01/28

Prev by Date: Re: Wide strings
Next by Date: Re: guile-lib licensing (input requested)
Previous by thread: Re: Wide strings
Next by thread: Re: Wide strings
Index(es):
- Date
- Thread