bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uuencode: multi-bytes char in remote file name contains bytes >0x80


From: Bruno Haible
Subject: Re: uuencode: multi-bytes char in remote file name contains bytes >0x80
Date: Wed, 6 Jul 2011 22:56:00 +0200
User-agent: KMail/1.9.9

Hi Bruce,

> I pick the way that is most robust and prone to the fewest problems.
> You tell me, please. :)

OK :)

> >    a) Do the charset conversion on the receiver's side, and on the sender's
> >       side only embed the charset. The most well-known encoding of this
> >       kind is probably the way subject lines are encoded in MIME:
> >       "jörg" would become
> >          =?iso-8859-1?Q?j=F6rg?=
> >       or
> >          =?utf-8?Q?j=C3=B6rg?=
> >       or
> >          hex-encode:3F69736F2D383835392D313F513F6A3D463672673F

This approach was preferred between ca. 1995 and 1999, because at that time,
it was not clear that Unicode would succeed in the way it did.

> >    b) Do the charset conversion both on the sender's side and on the
> >       receiver's side, and always send filenames converted to UTF-8.
> >       Example:
> >          j=C3=B6rg
> >       or
> >          hex-encode:6AC3B67267

Whereas this approach b) is the preferred one since ca. 2001.

> I'll do what you suggest and run the result
> past both you and our new friend, =?GB2312?B?j4jI/g==?=

You are presenting a good argument for b) and against a). Namely, the charset
label is often wrong. As in your example: It claims to be GB2312, but is in
fact CP936, an extension of GB2312 [1].

  $ echo -n j4jI/g== | base64 -d | iconv -f GB2312 -t UTF-8
  iconv: (stdin):1:0: cannot convert
  $ echo -n j4jI/g== | base64 -d | iconv -f CP936 -t UTF-8
  張叁

Such mislabeling is present in email and HTML, for historical reasons. It is
better to use approach b), because it does not require that the sender and
receiver have a common understanding what they mean by "GB2312" (or worse:
by "Big5").

Additionally, approach b) also leads to shorter strings usually than
approach a). Which is also a consideration, given that uuencode's output
should fit in 80 columns.

Bruno

[1] http://www.haible.de/bruno/charsets/conversion-tables/GB2312.html
-- 
In memoriam Jan Hus <http://en.wikipedia.org/wiki/Jan_Hus>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]