Re: Unibyte characters, strings, and buffers

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unibyte characters, strings, and buffers

From:	Eli Zaretskii
Subject:	Re: Unibyte characters, strings, and buffers
Date:	Fri, 28 Mar 2014 11:51:53 +0300

> From: "Stephen J. Turnbull" <address@hidden>
> Date: Fri, 28 Mar 2014 12:38:10 +0900
> Cc: Stefan Monnier <address@hidden>, address@hidden
> 
> Eli Zaretskii writes:
> 
>  > Paul seemed to say something more broad: that _all_ behaviors specific
>  > to unibyte buffers should go away.  Do you agree?
> 
> Yes, please.  XEmacs has never had the unibyte hack with Mule, and
> never has had much trouble with that.  It also has never had an
> instance of the \201 bug since Mule was declared stable -- where Emacs
> has had *many* regressions.

Let's not talk about Emacs 20 vintage problems, that is not useful.
Likewise examples from XEmacs, since the differences in this area
between Emacs and XEmacs are substantial, and that precludes useful
comparison.

> It's arguable that there are performance implications, but simply
> aliasing the binary codec to latin1-unix has *never* caused a bug in
> handling binary files -- all bugs are due to autodetection errors,
> not the buffer representation.

Forget about performance, there are real problems unrelated to that
which need to be solved, and I don't see how can you avoid them by
treating raw bytes as Latin-1 characters.  Let me explain.

First, we must have a way to have buffer "text" that represents a
stream of bytes, not some human-readable text.  (Just as a random
example, a buffer visiting an mbox file, from which you decode
portions into another buffer for display.)  Agreed?

In such unibyte buffers, we need a way to represent raw bytes, which
are parts of as yet un-decoded byte sequences that represent encoded
characters.  We cannot represent each such byte as a Latin-1
character, because Latin-1 characters are stored inside Emacs as
2-byte sequences of their UTF-8 encoding.  If you interpret bytes as
Latin-1 characters, functions like string-bytes will return wrong
results for those raw bytes.  Agreed?

So here you have already at least 2 valid reasons why Emacs must be
able to support raw bytes that are distinguishable from Latin-1
characters that have the same byte values, and why we must have
buffers that hold such raw bytes.  If we want to get rid of unibyte,
Someone(TM) should present a complete practical solution to those two
problems (and a few others), otherwise, this whole discussion leads
nowhere.  ("Practical" means that suggestions to introduce a character
data type are out of scope, or at least belong to an entirely
different discussion.)

> OTOH Emacs' unibyte buffer toggle is a design bug, pure and simple,
> and it should be backed up against a wall and immersed in
> insecticide.

I might even agree with you about the toggle.  But eliminating the
toggle doesn't solve the bigger issue, see above.

> If you stick to the interpretation that bytes contain non-negative
> integers less than 256, you won't have a problem in practice if you
> think them as the first 256 Unicode characters, but choose not to use
> functions that make sense only with characters.

What do you mean by "choose"?  Lisp code is used by many programmers
out there; sometimes, they aren't even aware if the buffer they work
on is unibyte, or what that means.  Even when they are aware, they
just want Emacs to DTRT, for their own value of "RT".  Unless each one
of those programmers "chooses" not to use the problematic functions,
we are back at square one.

And what does "choose not to use" mean, anyway?  How do you choose not
to use 'insert', for example? what do you use instead?

The issue at hand is how do you pull the trick, in practice, of doing
TRT with the legitimate use cases where Emacs needs to manipulate raw
bytes.

> Python actually implements many polymorphic functions (ie, they can
> be interpreted as bytes->bytes or characters->characters, etc) by
> converting bytes to characters as Latin-1, then using the character
> implementation of the function.

As long as Emacs exposes the character values to Lisp programs as
simple integers, I don't think we can take this path.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Buffer-local variables affect general-purpose functions, (continued)

Prev by Date: Re: Unibyte characters, strings and buffers
Next by Date: Re: Unibyte characters, strings, and buffers
Previous by thread: Re: Buffer-local variables affect general-purpose functions
Next by thread: Re: Unibyte characters, strings, and buffers
Index(es):
- Date
- Thread