Re: utf8 and emacs text/string multibyte representation

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 and emacs text/string multibyte representation

From:	Eli Zaretskii
Subject:	Re: utf8 and emacs text/string multibyte representation
Date:	Sat, 01 Nov 2014 11:01:33 +0200

> From: Camm Maguire <address@hidden>
> Cc: address@hidden,  address@hidden
> Date: Fri, 31 Oct 2014 14:05:20 -0400
> 
> Been discussing this elsewhere, and its come to my attention that not
> only do all unicode code-points not fit into UTF-16, but all unicode
> characters don't fit into unicode code-points :-).  Presumably this is
> why emacs expanded to 22bits?

Not sure what you mean here.  All Unicode characters do fit into the
Unicode codepoint space.  Emacs extends that codepoint space beyond 22
bits because it needs to support cultures which don't want unification
yet.

> If this is indeed the case, all these encodings have the same problems
> though varying in degree, and UTF-8 is clearly the smallest and most
> ascii compatible.  The question then arises as to whether lisp
> characters, which by definition do offer random access in strings, need
> be the same as or close to unicode characters.  

In Emacs, they are the same, yes.  Anything else means considerable
complications, AFAIR.

Random access to strings on the Lisp level is implemented as a
function on the C level, which simply walks the UTF-8 representation
one character at a time.  UTF-8 makes it easy to determine the number
of bytes by the first byte, so you compute that and move that many
bytes.

Emacs includes optimizations for a popular use case when each
character is a single byte (as in pure ASCII strings).  It also
records the last string used in aref and the last character and the
corresponding byte accessed in that string.  So if the Lisp program
access several characters of the same string that are close to each
other, the 2nd and subsequent calls to aref are much cheaper, because
they start from a closer starting point.

> Did you consider leaving aref, char-code and code-char alone and writing
> unicode functions on top of these, i.e. unicode-length!=length, as
> opposed to making aref itself do this translation under the hood,
> thereby violating the expectation of O(1) access, (which is certainly
> offered in other kinds of arrays, though it is questionable whether real
> users actually expect this for strings)?

What would be the benefit of having such byte-oriented aref?  Lisp
code needs to manipulate characters, not bytes.  Having byte-oriented
aref would just push the translation to characters to the Lisp level,
something no Lisp application wants or should want doing.

Internally, on the C level, Emacs does have access to individual
bytes, of course.  On that level, each string is indeed
byte-addressable at O(1) complexity.

> In doing so, one would then know that aref is random-access, and
> unicode-??? is sequential only.

As explained above, the access to characters is not really sequential
in Emacs, except for the first character of a string that was not
accessed yet.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: utf8 and emacs text/string multibyte representation, Eli Zaretskii <=
- Re: utf8 and emacs text/string multibyte representation, Stephen J. Turnbull, 2014/11/01
  - Re: utf8 and emacs text/string multibyte representation, David Kastrup, 2014/11/01
    - Re: utf8 and emacs text/string multibyte representation, Stephen J. Turnbull, 2014/11/01
    - Re: utf8 and emacs text/string multibyte representation, Stefan Monnier, 2014/11/01

Prev by Date: Re: The bzr revision map
Next by Date: Re: prin1-to-string noescape parameter
Previous by thread: Re: The bzr revision map
Next by thread: Re: utf8 and emacs text/string multibyte representation
Index(es):
- Date
- Thread