[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: accents
From: |
Chet Ramey |
Subject: |
Re: accents |
Date: |
Sun, 15 May 2011 18:16:52 -0400 |
User-agent: |
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.15) Gecko/20110303 Lightning/1.0b2 Thunderbird/3.1.9 |
On 5/10/11 9:17 AM, Greg Wooledge wrote:
>>> Is the accented character
>>> a single-byte character, or a multi-byte character, in your locale?
>>
>> a multi-byte character, i think
>> How to confirm that ?
(Keep in mind as you read my answers that I know very little more than
anyone else about Unicode combining characters and character composition.)
>
>> $ echo /Users/thomas/Downloads/réz | h
>> + echo $'/Users/thomas/Downloads/re?\201z'
>> + hexdump -C
>> 00000000 2f 55 73 65 72 73 2f 74 68 6f 6d 61 73 2f 44 6f
>> |/Users/thomas/Do|
>> 00000010 77 6e 6c 6f 61 64 73 2f 72 65 cc 81 7a 0a |wnloads/re..z.|
>> 0000001e
>
> Oh... now this is interesting. In my locale (not the one I'm writing this
> email from, but the one I tested in), an é is 0xc3 0xa9 which is the UTF-8
> encoding of the Unicode character U+00E9, LATIN SMALL LETTER E WITH ACUTE.
>
> In yours, however, it is 0x65 0xcc 0x81 which is U+0065 LATIN SMALL
> LETTER E followed by U+0301 COMBINING ACUTE ACCENT.
That's not valid UTF-8, since UTF-8 requires that the shortest sequence
be used to encode a character. The general problem with combining
characters still exists (the one in the message I referenced in an
earlier reply), but this case has more to do with Mac OS X and its use
of both precomposed and decomposed UTF-8 than anything.
> I'm not intimately familiar with this stuff myself, but it looks like
> a real bastard to me... I thought the point of UTF-8 was that you could
> read it a byte at a time, and know when you encountered a byte that
> signified the start of a multi-byte character. But apparently not!
> If I'm interpreting this COMBINING ACUTE ACCENT thing properly, the
> only indicator that you are in a multi-byte character comes with the
> *second* byte, so you have to backtrack. What idiot thought this up?
It's a way to provide a general mechanism for combining characters. Most
locales have unicode/utf-8 characters defined for the most common
accented characters (e.g., U+00E9), and the U+0301 stuff is a way to add
accents to less common characters without using up a character. It is
going to be a bitch to handle.
> With that in mind, let's see if I can reproduce some of this problem.
> Please bear in mind that as I paste this from the test environment
> terminal into the email-writing terminal, I have to make some manual
> adjustments to preserve the observed output.
I doubt you would be able to reproduce this on any system but Mac OS X.
Mac OS X keeps filenames in decomposed Unicode and keyboard input in
precomposed Unicode. Dragging and dropping filenames doesn't do the
decomposed-precomposed conversion.
> wooledg@wooledg:~$ touch $'re\xcc\x81z'
> wooledg@wooledg:~$ echo r?z
> r?z
> wooledg@wooledg:~$ echo r*z
> réz
> wooledg@wooledg:~$ ls -b r*z
> réz
>
> The terminal, when presented with the string of bytes that is the filename,
> renders it as réz. However, Bash's globbing does NOT recognize this as
> a three-character filename beginning with 'r' and ending with 'z', as
> the r?z glob was not expanded. ls -b also doesn't think there is anything
> particularly noteworthy about this filename, which is slightly annoying.
>
> (Bash's failure to glob this might be a second bug, or possibly another
> manifestation of the same bug you're pursuing.)
It's not a bug; that really is two characters. Just because U+00E9 and
the two-character combination U+0065 U+0301 look the same (I think the
term is identical graphemes) doesn't mean they are identical.
On RHEL 5 and Debian 9, at least, the file system stores filenames using
the same characters as used to create them. You were able to recreate how
Mac OS X stores filenames, but:
> When I double-click and then middle-click to select and paste the filename
> as rendered by the terminal back into the terminal, however, I do not
> get re\xcc\x81z any more; rather, I get r\xc3\xa9z. So my attempts
> to reproduce your reported problem in this way fail.
Because something does the decomposed-precomposed conversion.
> The next obvious way to reproduce the problem would be to get bash to
> produce the filename itself through tab completion, rather than pasting.
> With that in mind, I'll try to move the file to a different name that
> will be tab-completable.
The other difference is that drag-and-drop on Mac OS X (at least dropping
from the finder) produces full pathnames. I was able to reproduce display
problems (which I haven't yet investigated) using that, but not using
tab completion in the way you did.
(And Mac OS X does seem to have a problem with wcwidth: wcwidth on U+0301
returns 1 instead of 0).
Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU chet@case.edu http://cnswww.cns.cwru.edu/~chet/
- accents, Thomas De Contes, 2011/05/09
- Re: accents, Greg Wooledge, 2011/05/09
- Re: accents, Thomas De Contes, 2011/05/09
- Re: accents, Greg Wooledge, 2011/05/10
- Re: accents, Andreas Schwab, 2011/05/10
- Re: accents, Chet Ramey, 2011/05/10
- Re: accents,
Chet Ramey <=
- Re: accents, Andreas Schwab, 2011/05/15
- Re: accents, Chet Ramey, 2011/05/15
- Re: accents, Andreas Schwab, 2011/05/16
- Re: accents, Chet Ramey, 2011/05/15
Re: accents, Chet Ramey, 2011/05/16