[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
grep, UTF-8, character classes
From: |
Lebens-Lust |
Subject: |
grep, UTF-8, character classes |
Date: |
Sun, 8 Jan 2006 20:26:32 +0100 |
User-agent: |
KMail/1.7.2 |
Hello,
I'm using grep (GNU grep) 2.5.1.
Two problems occured, concerning character classes and UTF-8.
I didn't find any solution.
Subject 1: [[:alnum:]] = [0-9A-Za-z]; \w = synonym for [[:alnum:]] ...
Subject 2: Problem with range expressions like [A-Z] or [^a-z] and UTF-8
The strange thing about Subject 2 is: When you look at the UTF-8 examples
listed for Subject 1, you can see [0-9A-Za-z] worked fine. But [A-Z] and
[^a-z] in the examples for Subject 2 did not work ...
Is there maybe something like a rule behind this? Something that can help
me to see things clear again??? Or is it a bug that needs to be fixed?
Subjects and examples are listed below.
Thank you,
Conny
-----------------------------------------------------------------------------
SUBJECT 1:
----------
According to "man grep" [[:alnum:]] = [0-9A-Za-z], and \w = synonym for
[[:alnum:]], \W = synonym for [^[:alnum]].
My tests under Debian stable and SuSE 9.2 showed, it is
- TRUE when the locale (LC_CTYPE, LANG) is ISO-8859-x or POSIX.
- FALSE when the locale is UTF-8. Then
- \w = wrong output
- [[:alnum:]] = correct result
- [0-9A-Za-z] = correct result
- and the same when negated.
==> \w != [[:alnum:]] and \W != [^[:alnum:]]
Examples:
---------
bla="a ä á 1 - §"
- LANG and LC_CTYPE=de_DE.utf8 (also tested: en_US.UTF-8)
============================
for i in $bla ; do egrep '\w' <<<$i ; done
--> a 1
for i in $bla ; do egrep '[[:alnum:]]' <<<$i ; done
# OR: for i in $bla ; do egrep '[0-9A-Za-z]' <<<$i ; done
--> a ä á 1
for i in $bla ; do egrep '\W' <<<$i ; done
--> ä á - §
for i in $bla ; do egrep '[^[:alnum:]]' <<<$i ; done
# OR: for i in $bla ; do egrep '[^0-9A-Za-z]' <<<$i ; done
--> - §
==> [[:alnum:]] = [0-9A-Za-z] != \w
[^[:alnum:]] = [^0-9A-Za-z] != \W
---------------------------------
\w und \W = WRONG RESULT!!
- address@hidden (also tested: de_DE, en_US)
===================
for i in $bla ; do egrep '\w' <<<$i ; done
# OR: for i in $bla ; do egrep '[[:alnum:]]' <<<$i ; done
# OR: for i in $bla ; do egrep '[0-9A-Za-z]' <<<$i ; done
--> a ä á 1
for i in $bla ; do egrep '\W' <<<$i ; done
# OR: for i in $bla ; do egrep '[^[:alnum:]]' <<<$i ; done
# OR: for i in $bla ; do egrep '[^0-9A-Za-z]' <<<$i ; done
--> - §
==> CORRECT: [[:alnum:]] = [0-9A-Za-z] = \w
[^[:alnum:]] = [^0-9A-Za-z] = \W
-----------------------------------------
- LC_CTYPE=POSIX (= all LC_-Settings and LANG unset)
==============
for i in $bla ; do egrep '\w' <<<$i ; done
# OR: for i in $bla ; do egrep '[[:alnum:]]' <<<$i ; done
# OR: for i in $bla ; do egrep '[0-9A-Za-z]' <<<$i ; done
--> a 1
for i in $bla ; do egrep '\W' <<<$i ; done
# OR: for i in $bla ; do egrep '[^[:alnum:]]' <<<$i ; done
# OR: for i in $bla ; do egrep '[^0-9A-Za-z]' <<<$i ; done
--> ä á - §
==> CORRECT: [[:alnum:]] = [0-9A-Za-z] = \w
[^[:alnum:]] = [^0-9A-Za-z] = \W
-----------------------------------------
Outputs were incorrect as expected, since POSIX doesn't
recognise umlauts and accented chars as belonging to ASCII.
-----------------------------------------------------------------------------
SUBJECT 2:
----------
Range expressions like [A-Z] or [^a-z] work
- CORRECT when the locale (LC_CTYPE, LANG, LC_COLLATE) is ISO-8859-1(5),
- FALSE when the locale is UTF-8.
POSIX character classes like [[:upper:]] or [^[:lower:]] work fine with
both (ISO-8859-x and UTF-8).
Examples:
---------
var="A O Ö Ó a o ö ó"
- LANG and LC_CTYPE=de_DE.utf8
============================
for i in $var ; do egrep '[A-Z]' <<<"$i" ; done
--> A O Ö Ó o ö ó
*** WRONG ***
for i in $var ; do egrep '[^a-z]' <<<"$i" ; done
--> NO RESULT
*** WRONG ***
for i in $var ; do egrep '[[:upper:]]' <<<"$i" ; done
# for i in $var ; do egrep '[^[:lower:]]' <<<"$i" ; done
--> A O Ö Ó
*** CORRECT ***
==> You see, it doesn't make much sense to set LC_COLLATE to C or
POSIX: then I'd loose umlauts and accented chars.
- address@hidden
===================
for i in $var ; do egrep '[A-Z]' <<<"$i" ; done
# for i in $var ; do egrep '[^a-z]' <<<"$i" ; done
# for i in $var ; do egrep '[[:upper:]]' <<<"$i" ; done
# for i in $var ; do egrep '[^[:lower:]]' <<<"$i" ; done
--> A O Ö Ó
*** CORRECT ***
-----------------------------------------------------------------------------
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- grep, UTF-8, character classes,
Lebens-Lust <=