Re: multibyte locale patches for GNU utils available

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte locale patches for GNU utils available

From:	Pablo Saratxaga
Subject:	Re: multibyte locale patches for GNU utils available
Date:	Tue, 8 May 2001 21:48:11 +0200

Kaixo!

On Tue, May 08, 2001 at 11:05:58AM -0700, Ulrich Drepper wrote:

> There is no problem.  Even the current behavior is 100% correct.  If
> somebody wants to the C locale meaning of range expressions than make
> damned sure you use the C locale.  The arrogance of Unix users
> declaring their way of sorting to be the only true one is unbearable.
> Everybody except those used to the pre-i18n era Unix tools expects
> collation the way it is done now.  Get used to it.

I have nothing against collation; to the contrary I find it is correct,
and I find strcoll() does the right thing.

My point is not about collation, but about the fact that an intervall
[a-z] used in shell globbing or regexp includes upper case letters.
strcoll() (at least in bash, I didn't looked at others) is used internally
to interpet whether a given char is part of the intervall or not; but what
is used internally isn't very important, is the external effect I'm talking
about.

I think the big difference is that you consider [a-z] notation as being
linked to normal collation rules. I don't, I consider it is not a
normal collation but an interval limited to lowercase between a and z;
the locale influence here is which letters are inside that intervall;
for example I expect a French locale to match the 'ç' (lowercase c cedilla)
with the above '[a-z]' intervall.

I know about [[:lower::]], but it is not the same.
[[:lower::]] will (at least I expect it to) match lowercase cyrillic,
while [a-z] won't.
Then there are intervals that exclude some letters, eg: [b-df-hj-z]

And I think having case sensitivness on the shell globbing is important 
because the filesystem itself is case sensitive.

On DOS or MS-NT I wouldn't care the shell globbing to be case insensitive,
as the filesystem itself is case insensitive.
But on a case senstive filesystem, how are you supposed to easily
select files ? For example you have the following files:

abc efg Abc ABC txt readme rz

and you want to delete all the files starting with a lowercase; but those
starting with 'r'.

rm [a-qs-z]*

will do in 'C' locale; but on any other locale it will also delete 'Abc'
and 'ABC'.

The problem is that behaviour happened all of a sudden, and started being
noticed because of problems, like loss of files due to globbing starting
to match files not intended by the user.

changing to 'C' locale before doing any shell manipulation is quite
cumbersome (and has bad effects too, like improper sorting of the
outpur of 'ls'), and it was surely not the goal of the people who 
implemented that i18n support in bash (otherwise they would have kept ascii
only for globbing).

I think that treating the globbing as a case sensitive thing will solve
the problem.
Maye I'm wrong, and I haven't seen some of the involved problems, but
then please point me to where I'm wrong; don't tell me I don't understand
collating, because I do understand it; I simply don't think normal collating
should apply here. it should apply in the output of 'sort', in the sorting
of the output of 'ls', etc. but not in globbing.
Maybe uses case sensitivness will break something or some standard (but
then in 'C' locale it is case sensitive (well, not exactly, in 'C' there
isn't the notion of letter at all; only bytes and numeric values, but the
effect te user sees is case sensitive globbing); then my question is what
exactly would be breaken that way, and if it is more important than trying
to implement coherent behaviour with past common usage.
And, if really those two views (globbing as case sensitive or as case
insensitive) are irremediably incompatible; then, why not providing a way
to let the user choose how he wants it?

Thanks.

PS: the sample case sensitive strcoll() i posted has a serious bug of
only working with 8bit locales (I did it when GNU libc only properly
supported 8bit locales, but I know it is not an excuse), I'm ashamed of it.
I don't intended to tell it is that that must be used, nor that it is right
(it was bugged), but as an example, as sometimes an example can help
understand something better than a long description.
Indded I agree with you that interpretation of [x-y] intervals should
be done trough glob(), but the question remains: is it case sensitive or not?

> 
> -- 
> ---------------.                          ,-.   1325 Chesapeake Terrace
> Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
> Red Hat          `--' drepper at redhat.com   `------------------------

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/            PGP Key available, key ID: 0x8F0E4975

[Prev in Thread]

Current Thread

[Next in Thread]

multibyte locale patches for GNU utils available, Bruno Haible, 2001/05/08
- Re: multibyte locale patches for GNU utils available, Pablo Saratxaga, 2001/05/08
  - Re: multibyte locale patches for GNU utils available, Ulrich Drepper, 2001/05/08
    - Re: multibyte locale patches for GNU utils available, Pablo Saratxaga, 2001/05/08
    - Re: multibyte locale patches for GNU utils available, Ulrich Drepper, 2001/05/08
    - Re: multibyte locale patches for GNU utils available, Markus Kuhn, 2001/05/08
    - Re: multibyte locale patches for GNU utils available, Pablo Saratxaga <=
  - Re: multibyte locale patches for GNU utils available, Paul Eggert, 2001/05/08

Prev by Date: Re: --only-match
Next by Date: Re: --only-match
Previous by thread: Re: multibyte locale patches for GNU utils available
Next by thread: Re: multibyte locale patches for GNU utils available
Index(es):
- Date
- Thread