gcl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gcl-devel] utf8 and emacs text/string multibyte representation


From: Raymond Toy
Subject: Re: [Gcl-devel] utf8 and emacs text/string multibyte representation
Date: Sat, 01 Nov 2014 11:23:31 -0700
User-agent: Gnus/5.101 (Gnus v5.10.10) XEmacs/21.5-b34 (darwin)

>>>>> "Matt" == Matt Kaufmann <address@hidden> writes:

    Matt> Hi --
    Matt> I think you and Camm know more about this than I do, but to answer
    Matt> your question, below is what I get in GCL 2.6.12.  Except, I don't
    Matt> know how mailers handle high characters of the sort GCL printed in the
    Matt> output from (string (code-char 232)) below, so although that string
    Matt> was printed using a single character, here I show it as four
    Matt> characters (that visually appear just like the one-character version).

    >> (code-char 232)

    Matt> #\\350

    >> (string (code-char 232))

    Matt> "\350"

I think this is really what we're trying to figure out. What you show
is what gcl does today.  The question is what happens if unicode
support were added to gcl using 8-bit characters with utf-8 strings.

I think when unicode is added, gcl will do pretty much the same as
above, but the string is utf-8 encoded so a string consisting of a
single octet with value 232 is not a valid utf-8 string.  You need
more octets to form a unicode code-point.

To make a utf-8 string, you would have to do something like

(let ((s (make-string 2)))
  (setf (aref s 0) (code-char 195))
  (setf (aref s 1) (code-char 168))
  s)

Or maybe a utility function codepoints-to-string that takes a vector
of codepoints and creates a utf-8 string out of them.


    Matt> Interestingly, your (count nil (loop ...)) form also evaluates to 63
    Matt> in CCL, CLISP, and SBCL, but it evaluates to 32 in Allegro CL and 66
    Matt> in LispWorks.  It seems to me that the HyperSpec documentation allows
    Matt> for these differences.

Interesting. I don't have a copy of acl or lispworks, but cmucl
determines if a character is alpha-char-p using the unicode-category
of the codepoint.  I wonder how they differ....

--
Ray




reply via email to

[Prev in Thread] Current Thread [Next in Thread]