guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SRFI-14 and locale settings


From: Ludovic Courtès
Subject: Re: SRFI-14 and locale settings
Date: Thu, 14 Sep 2006 15:22:48 +0200
User-agent: Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux)

Hi,

Kevin Ryde <address@hidden> writes:

> address@hidden (Ludovic Courtès) writes:
>>
>> An example to illustrate what
>> I was trying to say: Both French and Castellano can be written using
>> Latin-1; however, letter `ñ' (`n' with tilde) is not a French letter
>> (thus, `isalpha ()' would return false with a Latin-1 `fr_FR' locale)
>
> In glibc fr_FR and es_ES have the same isalpha for all chars 0 to 255,
> it appears to be a property of the charset, not the language or
> location.

Indeed: I tested the same thing yesterday evening to discover that.  So
my whole theory just seems to be falling apart!  ;-)

I did some research to try to understand whether this is a
glibc-specific behavior, or whether this is made mandatory by some
standard.  Since I am not very knowledgeable about all these issues, I
made a whole lot of discoveries.


SUSv2 [0] explains that the `LC_CTYPE' category defines various
character classes (Section 7.3.1), notably the `alpha' class, that are
dependent on the "locale", without specifying whether they are dependent
specifically on the language.

On Debian GNU/Linux, the glibc-provided locale definition files are
available under `/usr/share/i18n/locale'.  Both the `fr_FR' and `es_ES'
files contain a line, in the `LC_CTYPE' section, that reads this:

  copy "i18n"

Actually, running the following command shows that a large number of
locales (those for western languages) contain this line:

  $ grep -A1 '^LC_CTYPE' /usr/share/i18n/locales/*_*

This "i18n" file contains a character classification definition
(`LC_CTYPE' section) whose contents are defined in ISO 14652 [1] as part
of a "generic" FDCC-set (Set of Formal Definitions of Cultural
Conventions).  The introduction to Section 4 of ISO 14652 reads this:

  This Technical Report also defines an FDCC-set named "i18n" with
  values for some of the above categories in order to simplify FDCC-set
  descriptions for a number of cultures.  The contents of "i18n"
  categories should not necessarily be considered as the most commonly
  accepted values, while in many cases it could be the recommended
  values.

The "i18n" character classification (listed in Section 4.3.2) is
actually very broad: it considers at least all Latin, Greek and Cyrillic
letters as part of the `alpha' character class.

My understanding (take it with a grain of salt...) of the above
quotation is that including "i18n" in various locales can be thought of
as a good way to get things "roughly working" first; however, actual
locale definitions could be refined to reflect more "commonly accepted
values".  So, for instance, one could refine the `LC_CTYPE' section of
glibc's `fr_FR' locale definition to make sure it only includes French
letters.


To summarize, using `isalpha ()' to determine the contents of
`char-set:letter' will probably yield correct results on most platforms,
at least on current glibc-based systems.  However, it seems that it is
"theoretically" incorrect, in that character classes are
language-dependent.

Therefore, explicitly listing all Latin-1 letters in `srfi-14.c' as Neil
suggested might be the best way.

Thanks,
Ludovic.


[0] http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
[1] http://www.open-std.org/jtc1/sc22/wg20/docs/projects#14652




reply via email to

[Prev in Thread] Current Thread [Next in Thread]