bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: supporting obscure languages


From: Bruno Haible
Subject: Re: supporting obscure languages
Date: Sat, 28 Nov 2009 11:34:55 +0100
User-agent: KMail/1.9.9

Albert Cahalan wrote:
> Well, that's the language I'm currently using for testing.
> I'm sure it's not the only thing failing. I have:
> 
> af.po   de.po     fi.po   id.po  nb.po     shs.po  tlh.po
> ar.po   el.po     fo.po   is.po  nl.po     sk.po   tr.po
> ast.po  en_AU.po  fr.po   it.po  nn.po     sl.po   twi.po
> az.po   en_CA.po  ga.po   ja.po  nr.po     son.po  uk.po
> be.po   en_GB.po  gd.po   ka.po  oc.po     sq.po   ve.po
> bg.po   en_ZA.po  gl.po   km.po  oj.po     sr.po   vi.po
> bo.po   eo.po     gos.po  ko.po  pl.po     sv.po   wa.po
> br.po   es.po     gu.po   ku.po  pt.po     sw.po   wo.po
> ca.po   es_MX.po  he.po   lt.po  pt_BR.po  ta.po   xh.po
> cs.po   et.po     hi.po   lv.po  ro.po     te.po   zam.po
> cy.po   eu.po     hr.po   mk.po  ru.po     th.po   zh_CN.po
> da.po   fa.po     hu.po   ms.po  rw.po     tl.po   zh_TW.po

84 languages! That is impressing. The largest number of translations
of a package in the Translation Project is currently 58 languages.

> > 2) You may need to define a glibc locale. This is necessary for a
> >   distinct language and optional for a variant (need it only if you
> >   want to override some localizations). You need it because things
> >   like month name, time display rules and the like are not defined
> >   by .po files but through a locale definition.
> 
> ... Tux Paint sure doesn't need any of that.
> I don't need month name, time display rules, telephone formats...
> 
> All I care about: LC_MESSAGES for "zam", LC_CTYPE not lobotomized

Then your workaround of doing
  LANGUAGE=zam LC_ALL=fr_FR.UTF-8
is just fine.

> I need two ways to make this happen. First, via the environment.
> Second, via function calls so that I can have the --locale=zam
> and --lang=zapotec options work.

For the first way, you can refer the user to the GNU gettext documentation
  http://www.gnu.org/software/gettext/manual/html_node/Users.html
or tell them to set LANGUAGE, if you prefer that.

For the second way, you can call
  setenv ("LC_ALL", "fr_FR.UTF-8", 1);
  setenv ("LANGUAGE", "zam", 1);

> BTW, we'd like fallback to similar translations in case something
> is missing. When zh_TW.mo lacks something, zh_CN.mo should be the
> next place to look.

That's a built-in feature in GNU gettext: just set the LANGUAGE variable to
  zh_TW:zh_CN
and you're done.

> I end up with glibc's broken "C" locale.
> Tux Paint's code does this now:
> 
> setlocale(LC_ALL, loc);  // loc="" or loc="zam"
> ctype_utf8();  // setlocale(LC_CTYPE,x) for many x until iswprint works

Yes, you have no guarantee that a particular locale is installed on the user's
system. You have to try some. setlocale(LC_ALL, "") is a good first guess.

> bindtextdomain("tuxpaint", LOCALEDIR);
> bind_textdomain_codeset("tuxpaint", "UTF-8");
> textdomain("tuxpaint");

Right.

> The i18n source is here:
> http://tuxpaint.cvs.sf.net/viewvc/tuxpaint/tuxpaint/src/i18n.c?revision=1.72
> 
> The interesting stuff starts in the set_current_locale(char *locale)
> function, with the requested locale being "" or from the command line.

Looks reasonable.

> My current hack: LANGUAGE=zam LC_ALL=fr_FR.UTF-8
> 
> Yep, I'm telling gettext that this is French. That's disgusting.

No, you are telling the system to use an UTF-8 encoding for strings,
French rules for time, sorting, numbers etc, and Zapotec for messages.
If it fits well with your program, all fine.

> There are quite a few design bugs here, none of which would cause
> huge problems all by itself. Together, they are a disaster.
> 
> a. The implementation-specific "" locale is "C". (it need not be)

No, when you call setlocale(LC_ALL,"") it uses the locale that the
user has set, not "C".

> b. The "C" locale is not UTF-8. (this need not be the case)

The "C" locale was defined at a time when there was no UTF-8. This
choice accommodates for output devices that cannot display arbitrary
Unicode characters (think of ssh into an older Unix system).

> c. The "C" locale makes iswprint((wchar_t)0xf7) be false. (very bad)

I agree with you that wide characters are a mess in ISO C, because the
meaning of (wchar_t)0xf7 depends on locales: in some locale it may be
a DIVISION SIGN, in another one a CYRILLIC SMALL LETTER YI, in another
one a LATIN SMALL LETTER S WITH ACUTE, and in another one it's invalid.

> d. The "C" locale ignores LC_MESSAGES, even if not "C".

What do you expect the system to do when you set LC_ALL to "C" and
then LC_MESSAGES to "zh_CN"? All characters are US-ASCII but messages
should be in Chinese? In earlier versions of glibc, the Chinese strings
were converted to "?????? ??? ?????? 32 ?????" before being displayed.
This was not really helpful; so now the translations are ignored
entirely in this case.

> e. The locale reverts to "C" if some portion is missing/unknown.

What's wrong with having a fallback if some portion is missing?

> The result is that none of these work:
> 
> a. setlocale(LC_ALL,"zam");
> b. setlocale(LC_MESSAGES,"zam");
> c. setlocale(LC_MESSAGES,"zam"); setlocale(LC_CTYPE,"UTF-8");

None of these work because you don't have a "zam" locale installed in
the first place. setlocale is about designating locales to use.

> There just doesn't seem to be any reasonable way to kick gettext into
> UTF-8 mode and feed it a *.mo file.

You found the way and showed it to us.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]