Re: supporting obscure languages

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: supporting obscure languages

From:	Albert Cahalan
Subject:	Re: supporting obscure languages
Date:	Sat, 28 Nov 2009 08:02:40 -0500

On Sat, Nov 28, 2009 at 5:34 AM, Bruno Haible <address@hidden> wrote:
> Albert Cahalan wrote:

>> I don't need month name, time display rules, telephone formats...
>>
>> All I care about: LC_MESSAGES for "zam", LC_CTYPE not lobotomized
>
> Then your workaround of doing
>  LANGUAGE=zam LC_ALL=fr_FR.UTF-8
> is just fine.

Don't you think that is terribly gross? (French with
different words!)

Don't you think it's doubly gross to have a program
calling setenv() to control a library via environment
variables intended for users instead of a proper API?

>> BTW, we'd like fallback to similar translations in case something
>> is missing. When zh_TW.mo lacks something, zh_CN.mo should be the
>> next place to look.
>
> That's a built-in feature in GNU gettext: just set the LANGUAGE variable to
>  zh_TW:zh_CN
> and you're done.

I guess we'll probably do that. Still, setenv as an API
is really disturbing. I greatly prefer to treat the environment
as read-only.

The library doesn't even get immediate notice that there
has been a change unless you have evil hooks into the
setenv and getenv functions. You'd have to either do a
slow getenv each time, or cache the value and hope the
program doesn't try to change things later.

>> setlocale(LC_ALL, loc);  // loc="" or loc="zam"
>> ctype_utf8();  // setlocale(LC_CTYPE,x) for many x until iswprint works
>
> Yes, you have no guarantee that a particular locale is installed on the user's
> system. You have to try some. setlocale(LC_ALL, "") is a good first guess.

That guess is just "C" on my system.

>> My current hack: LANGUAGE=zam LC_ALL=fr_FR.UTF-8
>>
>> Yep, I'm telling gettext that this is French. That's disgusting.
>
> No, you are telling the system to use an UTF-8 encoding for strings,
> French rules for time, sorting, numbers etc, and Zapotec for messages.
> If it fits well with your program, all fine.

Eh, the Zapotec dialect of French. It does work, as long as
the user happens to have fr_FR.UTF-8 installed.

That's trouble. I'm depending on some random unrelated locale
just to get normal UTF-8 behavior.

>> There are quite a few design bugs here, none of which would cause
>> huge problems all by itself. Together, they are a disaster.
>>
>> a. The implementation-specific "" locale is "C". (it need not be)
>
> No, when you call setlocale(LC_ALL,"") it uses the locale that the
> user has set, not "C".

I mean when the user has done nothing either. The "" doesn't
get filled in by some environment variable. You make it all the
way to the lowest-priority environment variable ("LANG") and
still have "". At that point, the implementation-specific locale
is chosen... and it is "C".

>> b. The "C" locale is not UTF-8. (this need not be the case)
>
> The "C" locale was defined at a time when there was no UTF-8. This
> choice accommodates for output devices that cannot display arbitrary
> Unicode characters (think of ssh into an older Unix system).

I can sort of understand this. I own a real VT510 terminal.

It's not a working protection though. Linux distributions often
set a UTF-8 locale, then fail to translate or otherwise protect
logins on the serial tty devices. This happens to be why procps
replaces UTF-8 characters containing the 0x9b byte. (but of
course that is potentially hostile data, not translations, and
Red Hat patches out the protection anyway)

Having "C" not be i18n-friendly (serving up UTF-8 messages
and full Unicode on wchar_t) wouldn't be a big deal except
for the fact that the locale so easily ends up being "C".
(when unspecified, when a locale is broken/unknown, etc.)

>> c. The "C" locale makes iswprint((wchar_t)0xf7) be false. (very bad)
>
> I agree with you that wide characters are a mess in ISO C, because the
> meaning of (wchar_t)0xf7 depends on locales: in some locale it may be
> a DIVISION SIGN, in another one a CYRILLIC SMALL LETTER YI, in another
> one a LATIN SMALL LETTER S WITH ACUTE, and in another one it's invalid.

Locales with non-Unicode wchar_t are far worse than locales
with non-UTF-8 char. Lots of software breaks, and nobody will
fix it. There comes a time to deprecate dysfunctional locales.

>> d. The "C" locale ignores LC_MESSAGES, even if not "C".
>
> What do you expect the system to do when you set LC_ALL to "C" and
> then LC_MESSAGES to "zh_CN"? All characters are US-ASCII but messages
> should be in Chinese? In earlier versions of glibc, the Chinese strings
> were converted to "?????? ??? ?????? 32 ?????" before being displayed.
> This was not really helpful; so now the translations are ignored
> entirely in this case.

Just be binary-clean. Remember why UTF-8 was invented.
If glibc were binary clean, messages would normally just work.
They would certainly work for typical GUI stuff using Pango,
and would even work in many terminal situations.

>> e. The locale reverts to "C" if some portion is missing/unknown.
>
> What's wrong with having a fallback if some portion is missing?

Nothing. The problem is how this interacts with the other stuff.
If the fallback were something like "C.UTF-8" or the "C" locale
wasn't severely limited, there would be no problem.

It's only the combination of all these design issues that results in
a problem. Individually, no one design issue is really a problem.

>> The result is that none of these work:
>>
>> a. setlocale(LC_ALL,"zam");
>> b. setlocale(LC_MESSAGES,"zam");
>> c. setlocale(LC_MESSAGES,"zam"); setlocale(LC_CTYPE,"UTF-8");
>
> None of these work because you don't have a "zam" locale installed in
> the first place. setlocale is about designating locales to use.

I have a piece of a locale installed. (my "zam.mo" file)
To use that, I mainly just need a binary-clean library.
Getting iswprint() and towupper() would be nice too, but
it's not a huge problem for me to write my own.

Basically: use what is there, and assume something close
to "C.UTF-8" for anything missing/broken. Maybe you could
find choices that are more generic than "C", like 24-hour time
and PA4 paper size. Maybe round-trip the case for U+1E9E,
avoiding expansion troubles. You could call it "default.UTF-8".

The details aren't terribly critical; the main thing is to let a
random loose UTF-8 *.mo file work without hacks or fuss,
along with the wchar_t functions working beyond ASCII.

>> There just doesn't seem to be any reasonable way to kick gettext into
>> UTF-8 mode and feed it a *.mo file.
>
> You found the way and showed it to us.

Trying random unrelated locales and calling putenv() is
pretty far from reasonable IMHO.

[Prev in Thread]

Current Thread

[Next in Thread]

supporting obscure languages, Albert Cahalan, 2009/11/27
- Re: supporting obscure languages, Bruno Haible, 2009/11/27
  - Re: supporting obscure languages, John Cowan, 2009/11/27
    - Re: supporting obscure languages, Bruno Haible, 2009/11/28
  - Re: supporting obscure languages, Albert Cahalan, 2009/11/27
    - Re: supporting obscure languages, Bruno Haible, 2009/11/28
    - Re: supporting obscure languages, Eric Blake, 2009/11/28
    - Re: supporting obscure languages, Bruno Haible, 2009/11/28
    - Re: supporting obscure languages, Albert Cahalan <=
    - Re: supporting obscure languages, Bruno Haible, 2009/11/28
    - Re: supporting obscure languages, Albert Cahalan, 2009/11/28
    - Re: German uppercasing rules (was: supporting obscure languages), Bruno Haible, 2009/11/28
    - Re: German uppercasing rules (was: supporting obscure languages), Albert Cahalan, 2009/11/28
    - Re: German uppercasing rules (was: supporting obscure languages), Bruno Haible, 2009/11/28
    - Re: German uppercasing rules (was: supporting obscure languages), John Cowan, 2009/11/28

Prev by Date: Re: supporting obscure languages
Next by Date: Re: German uppercasing rules (was: supporting obscure languages)
Previous by thread: Re: supporting obscure languages
Next by thread: Re: supporting obscure languages
Index(es):
- Date
- Thread