chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] string-translate and utf-8


From: Alex Shinn
Subject: Re: [Chicken-users] string-translate and utf-8
Date: Sat, 08 Nov 2008 11:10:43 +0900
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)

Hi, sorry for the late reply.

Sunnan <address@hidden> writes:

> I'm updating old code that used to work:
>
> (require-extension syntax-case utf8 srfi-1 utf8-srfi-13 miscmacros)
>
> ;(import utf8)
> ;(import utf8-srfi-13) ;(commented out since they're not needed anymore?)
> (use utf8-srfi-13)  ;(I've tried with and without this line)
>
> (string-translate " i " "ö " "o_") ;; this should eval to "_i_"
>
>
> but i get Error: (vector-ref) out of range, I guess because it reads the
> multi-byte characters (i.e. #\ö) as multiple entries in the vector.

I can't reproduce this.  The utf8 extension (not
utf8-srfi-13) does provide a STRING-TRANSLATE replacement
which handles multi-byte characters (verified on a Chinese
example in the test suite).

The only thing I can think might be going wrong is the
normalization form.  If the ö you input into is not the
single Unicode character U+00F6 (LATIN SMALL LETTER O WITH
DIAERESIS), but is rather U+006F (LATIN SMALL LETTER O)
followed by U+0308 (COMBINING DIAERESIS), then you have not
only multi-byte characters, but multi-*codepoint*
characters.  STRING-TRANSLATE, as with all Unicode
utilities, works at the codepoint level, not the extended
grapheme level.  Thus the first vector has 3 elements to the
second vectors' 2 elements, and the range error occurs.

Modifying STRING-TRANSLATE to work at the extended grapheme
level rather than the codepoint level would be a lot of
work, and possibly not what people expect.

As a workaround, if you have no control over the
normalization forms, you can always use STRING-TRANSLATE*.

-- 
Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]