Re: accents

bug-bash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: accents

From:	Chet Ramey
Subject:	Re: accents
Date:	Sun, 15 May 2011 18:16:52 -0400
User-agent:	Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.15) Gecko/20110303 Lightning/1.0b2 Thunderbird/3.1.9

On 5/10/11 9:17 AM, Greg Wooledge wrote:

>>> Is the accented character
>>> a single-byte character, or a multi-byte character, in your locale?
>>
>> a multi-byte character, i think
>> How to confirm that ?

(Keep in mind as you read my answers that I know very little more than
anyone else about Unicode combining characters and character composition.)

> 
>> $ echo /Users/thomas/Downloads/réz | h
>> + echo $'/Users/thomas/Downloads/re?\201z'
>> + hexdump -C
>> 00000000  2f 55 73 65 72 73 2f 74  68 6f 6d 61 73 2f 44 6f  
>> |/Users/thomas/Do|
>> 00000010  77 6e 6c 6f 61 64 73 2f  72 65 cc 81 7a 0a        |wnloads/re..z.|
>> 0000001e
> 
> Oh... now this is interesting.  In my locale (not the one I'm writing this
> email from, but the one I tested in), an é is 0xc3 0xa9 which is the UTF-8
> encoding of the Unicode character U+00E9, LATIN SMALL LETTER E WITH ACUTE.
> 
> In yours, however, it is 0x65 0xcc 0x81 which is U+0065 LATIN SMALL
> LETTER E followed by U+0301 COMBINING ACUTE ACCENT.

That's not valid UTF-8, since UTF-8 requires that the shortest sequence
be used to encode a character.  The general problem with combining
characters still exists (the one in the message I referenced in an
earlier reply), but this case has more to do with Mac OS X and its use
of both precomposed and decomposed UTF-8 than anything.

> I'm not intimately familiar with this stuff myself, but it looks like
> a real bastard to me... I thought the point of UTF-8 was that you could
> read it a byte at a time, and know when you encountered a byte that
> signified the start of a multi-byte character.  But apparently not!
> If I'm interpreting this COMBINING ACUTE ACCENT thing properly, the
> only indicator that you are in a multi-byte character comes with the
> *second* byte, so you have to backtrack.  What idiot thought this up?

It's a way to provide a general mechanism for combining characters.  Most
locales have unicode/utf-8 characters defined for the most common
accented characters (e.g., U+00E9), and the U+0301 stuff is a way to add
accents to less common characters without using up a character.  It is
going to be a bitch to handle.

> With that in mind, let's see if I can reproduce some of this problem.
> Please bear in mind that as I paste this from the test environment
> terminal into the email-writing terminal, I have to make some manual
> adjustments to preserve the observed output.

I doubt you would be able to reproduce this on any system but Mac OS X.
Mac OS X keeps filenames in decomposed Unicode and keyboard input in
precomposed Unicode.  Dragging and dropping filenames doesn't do the
decomposed-precomposed conversion.

> wooledg@wooledg:~$ touch $'re\xcc\x81z'
> wooledg@wooledg:~$ echo r?z
> r?z
> wooledg@wooledg:~$ echo r*z
> réz
> wooledg@wooledg:~$ ls -b r*z
> réz
> 
> The terminal, when presented with the string of bytes that is the filename,
> renders it as réz.  However, Bash's globbing does NOT recognize this as
> a three-character filename beginning with 'r' and ending with 'z', as
> the r?z glob was not expanded.  ls -b also doesn't think there is anything
> particularly noteworthy about this filename, which is slightly annoying.
> 
> (Bash's failure to glob this might be a second bug, or possibly another
> manifestation of the same bug you're pursuing.)

It's not a bug; that really is two characters. Just because U+00E9 and
the two-character combination U+0065 U+0301 look the same (I think the
term is identical graphemes) doesn't mean they are identical.

On RHEL 5 and Debian 9, at least, the file system stores filenames using
the same characters as used to create them.  You were able to recreate how
Mac OS X stores filenames, but:

> When I double-click and then middle-click to select and paste the filename
> as rendered by the terminal back into the terminal, however, I do not
> get re\xcc\x81z any more; rather, I get r\xc3\xa9z.  So my attempts
> to reproduce your reported problem in this way fail.

Because something does the decomposed-precomposed conversion.

> The next obvious way to reproduce the problem would be to get bash to
> produce the filename itself through tab completion, rather than pasting.
> With that in mind, I'll try to move the file to a different name that
> will be tab-completable.

The other difference is that drag-and-drop on Mac OS X (at least dropping
from the finder) produces full pathnames.  I was able to reproduce display
problems (which I haven't yet investigated) using that, but not using
tab completion in the way you did.

(And Mac OS X does seem to have a problem with wcwidth: wcwidth on U+0301
returns 1 instead of 0).

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    chet@case.edu    http://cnswww.cns.cwru.edu/~chet/

[Prev in Thread]

Current Thread

[Next in Thread]

accents, Thomas De Contes, 2011/05/09
- Re: accents, Greg Wooledge, 2011/05/09
  - Re: accents, Thomas De Contes, 2011/05/09
    - Re: accents, Greg Wooledge, 2011/05/10
    - Re: accents, Andreas Schwab, 2011/05/10
    - Re: accents, Chet Ramey, 2011/05/10
    - Re: accents, Chet Ramey <=
    - Re: accents, Andreas Schwab, 2011/05/15
    - Re: accents, Chet Ramey, 2011/05/15
    - Re: accents, Andreas Schwab, 2011/05/16
    - Re: accents, Chet Ramey, 2011/05/15
- Re: accents, Chet Ramey, 2011/05/16

Prev by Date: Feature Request - Allow mapfile to handle NUL-delimited data
Next by Date: Re: accents
Previous by thread: Re: accents
Next by thread: Re: accents
Index(es):
- Date
- Thread