Re: supporting obscure languages

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: supporting obscure languages

From:	Albert Cahalan
Subject:	Re: supporting obscure languages
Date:	Fri, 27 Nov 2009 18:09:12 -0500

On Fri, Nov 27, 2009 at 7:42 AM, Bruno Haible <address@hidden> wrote:

> You did not state what you are trying to do. I understand it like this:
> "How do I add support for a specific, rarely used language to my system
>  in such a way that I can localize programs for this language?"

I'm interested in that, and I think it should be trivial, but I'm
actually dealing with this from the view of a software developer
with existing *.mo files. I'm working on Tux Paint.

(think of the children)

At this point I'm seriously considering ripping out the gettext
stuff because it is fighting me every step of the way. It looks
like less trouble to write my own; we already do this for audio
and fonts. I hope you wish for gettext to be easy to work with.

> 1) You need to define a locale identifier for it. This is important,
>   because the users and all translators must agree on it - if a
>   translator uses a different identifier than the user, her
>   translations will not be found. The standardized identifiers
>   are those in ISO 639-1 and ISO 639-2, and also found in glibc's
>   glibc/locale/iso-639.def.

Done. It's some Zapotec thing that I know very little about.
I'm not the translator. The translator(s) decided, and I'm
certainly not about to argue.

Well, that's the language I'm currently using for testing.
I'm sure it's not the only thing failing. I have:

af.po   de.po     fi.po   id.po  nb.po     shs.po  tlh.po
ar.po   el.po     fo.po   is.po  nl.po     sk.po   tr.po
ast.po  en_AU.po  fr.po   it.po  nn.po     sl.po   twi.po
az.po   en_CA.po  ga.po   ja.po  nr.po     son.po  uk.po
be.po   en_GB.po  gd.po   ka.po  oc.po     sq.po   ve.po
bg.po   en_ZA.po  gl.po   km.po  oj.po     sr.po   vi.po
bo.po   eo.po     gos.po  ko.po  pl.po     sv.po   wa.po
br.po   es.po     gu.po   ku.po  pt.po     sw.po   wo.po
ca.po   es_MX.po  he.po   lt.po  pt_BR.po  ta.po   xh.po
cs.po   et.po     hi.po   lv.po  ro.po     te.po   zam.po
cy.po   eu.po     hr.po   mk.po  ru.po     th.po   zh_CN.po
da.po   fa.po     hu.po   ms.po  rw.po     tl.po   zh_TW.po

(ever see that many at once before?)

> 2) You may need to define a glibc locale. This is necessary for a
>   distinct language and optional for a variant (need it only if you
>   want to override some localizations). You need it because things
>   like month name, time display rules and the like are not defined
>   by .po files but through a locale definition.

Frankly, I don't give a shit. If somebody decides they care, they
can define these things. Tux Paint sure doesn't need any of that.
I don't need month name, time display rules, telephone formats...

All I care about: LC_MESSAGES for "zam", LC_CTYPE not lobotomized

I need two ways to make this happen. First, via the environment.
Second, via function calls so that I can have the --locale=zam
and --lang=zapotec options work.

> To create a locale, use the
>   "localedef" command together with a locale definition file. There
>   are dozens of examples of these locale definition files in a
>   directory mentioned in the output of "localedef --help".

That would be complicated sysadmin work. These machines probably
run es_MX.UTF-8 most of the time, or maybe C. Nobody wants to wait
for Fedora or Debian to take their sweet time adding "zam".
This also, somehow, needs to work for Windows and MacOS X. It will
be a cold day in Hell before Microsoft or Apple supports Zapotec.

> 3) Then you can create .mo files from .po files for that language,
>   as described in the GNU gettext documentation.

Done. Tux Paint includes 84 translations.

BTW, we'd like fallback to similar translations in case something
is missing. When zh_TW.mo lacks something, zh_CN.mo should be the
next place to look.

>> How can a program offer a non-environment way to override the source
>> of messages? The obvious setlocale(LC_ALL,"zam") does not work, nor
>> does the troublesome (because other locales need more) substitution
>> of setlocale(LC_MESSAGES,"zam").
>
> There is setlocale, and there is bindtextdomain. But you should have a
> locale first.

It doesn't work. I end up with glibc's broken "C" locale.
Tux Paint's code does this now:

setlocale(LC_ALL, loc);  // loc="" or loc="zam"
ctype_utf8();  // setlocale(LC_CTYPE,x) for many x until iswprint works
bindtextdomain("tuxpaint", LOCALEDIR);
bind_textdomain_codeset("tuxpaint", "UTF-8");
textdomain("tuxpaint");

The i18n source is here:
http://tuxpaint.cvs.sf.net/viewvc/tuxpaint/tuxpaint/src/i18n.c?revision=1.72

The interesting stuff starts in the set_current_locale(char *locale)
function, with the requested locale being "" or from the command line.

>> BTW, please consider it a bug that that doesn't just work.
>
> No, not a bug. This is the way locales are designed.

That makes it a design bug.

My current hack: LANGUAGE=zam LC_ALL=fr_FR.UTF-8

Yep, I'm telling gettext that this is French. That's disgusting.

There are quite a few design bugs here, none of which would cause
huge problems all by itself. Together, they are a disaster.

a. The implementation-specific "" locale is "C". (it need not be)
b. The "C" locale is not UTF-8. (this need not be the case)
c. The "C" locale makes iswprint((wchar_t)0xf7) be false. (very bad)
d. The "C" locale ignores LC_MESSAGES, even if not "C".
e. The locale reverts to "C" if some portion is missing/unknown.

The result is that none of these work:

a. setlocale(LC_ALL,"zam");
b. setlocale(LC_MESSAGES,"zam");
c. setlocale(LC_MESSAGES,"zam"); setlocale(LC_CTYPE,"UTF-8");

All should do the job, using any info that is available and picking
generic modern choices for the rest.

There just doesn't seem to be any reasonable way to kick gettext into
UTF-8 mode and feed it a *.mo file. This should be more than easy; it
should be what you tend to end up with when things aren't consistant.

I could see "C" being Latin-1 by default instead of UTF-8 (though wide
character functions should still support full Unicode), and I could see
having message lookup disabled if **nothing** non-C is enabled. Once I
call bind_textdomain_codeset("tuxpaint","UTF-8") or setlocale(foo,"zam")
though, it should be obvious what gettext needs to do.

[Prev in Thread]

Current Thread

[Next in Thread]

supporting obscure languages, Albert Cahalan, 2009/11/27
- Re: supporting obscure languages, Bruno Haible, 2009/11/27
  - Re: supporting obscure languages, John Cowan, 2009/11/27
    - Re: supporting obscure languages, Bruno Haible, 2009/11/28
  - Re: supporting obscure languages, Albert Cahalan <=
    - Re: supporting obscure languages, Bruno Haible, 2009/11/28
    - Re: supporting obscure languages, Eric Blake, 2009/11/28
    - Re: supporting obscure languages, Bruno Haible, 2009/11/28
    - Re: supporting obscure languages, Albert Cahalan, 2009/11/28
    - Re: supporting obscure languages, Bruno Haible, 2009/11/28
    - Re: supporting obscure languages, Albert Cahalan, 2009/11/28
    - Re: German uppercasing rules (was: supporting obscure languages), Bruno Haible, 2009/11/28
    - Re: German uppercasing rules (was: supporting obscure languages), Albert Cahalan, 2009/11/28
    - Re: German uppercasing rules (was: supporting obscure languages), Bruno Haible, 2009/11/28
    - Re: German uppercasing rules (was: supporting obscure languages), John Cowan, 2009/11/28

Prev by Date: Re: The awk bug
Next by Date: Re: supporting obscure languages
Previous by thread: Re: supporting obscure languages
Next by thread: Re: supporting obscure languages
Index(es):
- Date
- Thread