emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf-8.el


From: Stefan Monnier
Subject: Re: utf-8.el
Date: Tue, 18 Jan 2005 23:37:10 -0500
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3.50 (gnu/linux)

>> Also, could anyone confirm that the docstring of mule-utf-8 is correct in
>> saying that invalid utf-8 sequences are not always correctly preserved?
>> Why is that?  Can't we fix it?

> I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
> invalid utf-8 sequence as far as possible.  So perhaps the
> current version preserves even invalid sequence correctly.

That's also what I remembered, which is why I asked.

>> Also could anyone explain to me why `utf-8-compose' needs to lookup the
>> hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), since
>> it looks to me like ccl-decode-mule-utf-8 already takes care of decoding
>> chars that are in this table.

> subst-tables are not preloaded.  They are automatically
> loaded in utf-8-post-read-conversion but it runs after
> ccl-decode-mule-utf-8 is executed.  And the arg hash-table
> becomes non-nil only when subst-tables are loaded.

Oh, so the elisp code indeed does the same thing.  And that means it's only
really used at most once per Emacs session (since after it's executed, the
hash-table will be active directly in ccl-decode-mule-utf-8).  Right?

>> I also don't understand the following part of
>> the code:

>> (if (= l 2)
>> (put-text-property (point) (min (point-max) (+ l (point)))
>> 'display (format "\\%03o" ch))
>> (compose-region (point) (+ l (point)) ?�))

>> what does it mean for l (the number of bytes) to be equal to 2?

> The docstring of ccl-untranslated-to-ucs is not clear.  In
> "Set r1 to the byte length", the byte length means how many
> of r0, r1, r2, r3 (each of them contains a byte) contribute
> to a unicode character (or an invalid byte).

So it's the number of bytes used in the buffer's internal representation
(i.e. emacs-mule), not the number of bytes used in the utf-8 representation?

> If l is 2, that means an invalid byte was converted to
> two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
> eight-bit-control/graphic.

And that's because any other utf-8 char maps to either a 3-byte sequence
(in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
(like latin-1) it won't pass through this code anyway?

> In that case, it is better to
> display that sequence by octal instead of showing ?�.

Yes, I understand this part.  I just have a hard time following the
reasoning that gets us to the point where we know that (= l 2) implies that
it's a single eight-bit-control or eight-bit-graphic char.

>> -      ;; Can't do eval-when-compile to insert a multibyte constant
>> -      ;; version of the string in the loop, since it's always loaded as
>> -      ;; unibyte from a byte-compiled file.
>> -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>> +      (let ((range "^\xc0-\xc3\xe1-\xf7")

> This change is not good because range is set to a unibyte
> string and regexp search converts it to a multibyte
> string by `make-multibyte-string'.  Here what we need is a
> multibyte string that contains eight-bit-graphci/control
> chars.

I know that's what the comment says, but my tests lead me to believe that
the comment is not correct and that the string's multibyteness is
correctly preserved.

> Anyway it is better to change string-as-multibyte to string-to-multibyte.

Indeed.


        Stefan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]