bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uuencode: multi-bytes char in remote file name contains bytes >0x80


From: Bruce Korb
Subject: Re: uuencode: multi-bytes char in remote file name contains bytes >0x80
Date: Fri, 08 Jul 2011 16:11:38 -0700
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110616 SUSE/3.1.11 Thunderbird/3.1.11

Hi Eric(s),

This mojibake stuff is mumbo jumbo to me.

I looked into the iconv(3p) function a bit and it seems to be dependent
upon some characters strings that are different from what one might
put in LANG or LC_ALL or LC_NAME environment variables.  Those guys
take things like EN_us, for example, not character set specifications.
So how am I to know what the current character set it if all I know is
CN_hk, for example?  I also didn't find a "this is how you do it" cookbook
or tutorial.  I'd have this wired just as soon as I could figure out what
string to pass to iconv_open(3p).  Pointers certainly appreciated!

Regards, Bruce

>>> Where and how will the charset conversion of the filenames be handled?
>>
>> Yes, it will be.
> 
> The only sane approach is to assume that the current locale of the user
> running uuencode normally sees sane filenames, and transliterate from
> the user's locale into UTF-8.  Either the filename is a character string
> in the user's current locale (and therefore, every character can be
> transliterated into UTF-8; perhaps trivially if the user's locale is
> already UTF-8), or the filename is already random bytes that the user
> cannot see as characters in their current locale.  In the latter case,
> you can still do a 1:1 mapping, where all invalid bytes are mapped to a
> 2nd-half of a UTF-8 surrogate pair.
> 
> Then, take that UTF-8 multibyte sequence (including 2nd-half surrogate
> pair mappings for all invalid bytes that were not characters), and
> flatten it into something that is just ascii.
> 
> On the uudecode side, take the ascii and convert it back to UTF-8, then
> transliterate into the user's current locale.  Here, the transliteration
> might be lossy (if the user's charset doesn't support all the characters
> that were in the input) - here, I'm not sure whether best practice is to
> transliterate from the unrepresentable character to '?' or to leave the
> unrepresentable character as raw Unicode bytes (the latter is what leads
> to mojibake).  But if the receiver's current locale is UTF-8, lossy
> transliteration is not an issue.  Meanwhile, if the encoded string
> contained any unmatched 2nd-half surrogate pairs, you can unambiguously
> recover the raw byte that was not a character, and use that byte as-is.
> 
> The nice part about this algorithms is that if both sender and receiver
> only use a subset of characters that exist in both charsets, then they
> both see the same filename, even if the two locations are using
> different charset.  If the receiver is using UTF-8 (which is more and
> more common these days), they will see whatever name the sender saw
> regardless of the sender's charset.  The only place where mojibake still
> happens if the sender uses characters that are not in the receivers
> charset - and that's not entirely a real loss, since it was already the
> case that the sender is doing non-portable things by sending
> non-portable filename characters in the first place.
> 
>>>        or
>>>           =?utf-8?Q?j=C3=B6rg?=
> 
> You want some sort of utf-8 encoding, and preferably one that encodes
> only the non-portable characters.  This type of naming looks best to me.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]