[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: uuencode: multi-bytes char in remote file name contains bytes >0x80
From: |
Bruce Korb |
Subject: |
Re: uuencode: multi-bytes char in remote file name contains bytes >0x80 |
Date: |
Fri, 08 Jul 2011 16:11:38 -0700 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110616 SUSE/3.1.11 Thunderbird/3.1.11 |
Hi Eric(s),
This mojibake stuff is mumbo jumbo to me.
I looked into the iconv(3p) function a bit and it seems to be dependent
upon some characters strings that are different from what one might
put in LANG or LC_ALL or LC_NAME environment variables. Those guys
take things like EN_us, for example, not character set specifications.
So how am I to know what the current character set it if all I know is
CN_hk, for example? I also didn't find a "this is how you do it" cookbook
or tutorial. I'd have this wired just as soon as I could figure out what
string to pass to iconv_open(3p). Pointers certainly appreciated!
Regards, Bruce
>>> Where and how will the charset conversion of the filenames be handled?
>>
>> Yes, it will be.
>
> The only sane approach is to assume that the current locale of the user
> running uuencode normally sees sane filenames, and transliterate from
> the user's locale into UTF-8. Either the filename is a character string
> in the user's current locale (and therefore, every character can be
> transliterated into UTF-8; perhaps trivially if the user's locale is
> already UTF-8), or the filename is already random bytes that the user
> cannot see as characters in their current locale. In the latter case,
> you can still do a 1:1 mapping, where all invalid bytes are mapped to a
> 2nd-half of a UTF-8 surrogate pair.
>
> Then, take that UTF-8 multibyte sequence (including 2nd-half surrogate
> pair mappings for all invalid bytes that were not characters), and
> flatten it into something that is just ascii.
>
> On the uudecode side, take the ascii and convert it back to UTF-8, then
> transliterate into the user's current locale. Here, the transliteration
> might be lossy (if the user's charset doesn't support all the characters
> that were in the input) - here, I'm not sure whether best practice is to
> transliterate from the unrepresentable character to '?' or to leave the
> unrepresentable character as raw Unicode bytes (the latter is what leads
> to mojibake). But if the receiver's current locale is UTF-8, lossy
> transliteration is not an issue. Meanwhile, if the encoded string
> contained any unmatched 2nd-half surrogate pairs, you can unambiguously
> recover the raw byte that was not a character, and use that byte as-is.
>
> The nice part about this algorithms is that if both sender and receiver
> only use a subset of characters that exist in both charsets, then they
> both see the same filename, even if the two locations are using
> different charset. If the receiver is using UTF-8 (which is more and
> more common these days), they will see whatever name the sender saw
> regardless of the sender's charset. The only place where mojibake still
> happens if the sender uses characters that are not in the receivers
> charset - and that's not entirely a real loss, since it was already the
> case that the sender is doing non-portable things by sending
> non-portable filename characters in the first place.
>
>>> or
>>> =?utf-8?Q?j=C3=B6rg?=
>
> You want some sort of utf-8 encoding, and preferably one that encodes
> only the non-portable characters. This type of naming looks best to me.
- uuencode: multi-bytes char in remote file name contains bytes >0x80, ��叁, 2011/07/03
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Bruce Korb, 2011/07/03
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Eric, 2011/07/03
- Message not available
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Eric, 2011/07/06
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Bruce Korb, 2011/07/06
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Bruno Haible, 2011/07/06
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Bruce Korb, 2011/07/06
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Bruno Haible, 2011/07/06
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Eric Blake, 2011/07/06
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80,
Bruce Korb <=
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Eric Blake, 2011/07/08
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Eli Zaretskii, 2011/07/09
- Re: file names encoding on Windows, Bruno Haible, 2011/07/09
- Re: file names encoding on Windows, Eli Zaretskii, 2011/07/09
- Re: file names encoding on Windows, Bruce Korb, 2011/07/09
- Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Bruno Haible, 2011/07/08
Re: uuencode: multi-bytes char in remote file name contains bytes >0x80, Bruce Korb, 2011/07/03