grep, UTF-8, character classes

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep, UTF-8, character classes

From:	Lebens-Lust
Subject:	grep, UTF-8, character classes
Date:	Sun, 8 Jan 2006 20:26:32 +0100
User-agent:	KMail/1.7.2

Hello,

I'm using grep (GNU grep) 2.5.1.
Two problems occured, concerning character classes and UTF-8.
I didn't find any solution.

Subject 1: [[:alnum:]] = [0-9A-Za-z]; \w = synonym for [[:alnum:]] ...
Subject 2: Problem with range expressions like [A-Z] or [^a-z] and UTF-8

The strange thing about Subject 2 is: When you look at the UTF-8 examples
listed for Subject 1, you can see [0-9A-Za-z] worked fine. But [A-Z] and
[^a-z] in the examples for Subject 2 did not work ...

Is there maybe something like a rule behind this? Something that can help
me to see things clear again??? Or is it a bug that needs to be fixed?

Subjects and examples are listed below.

Thank you,

Conny


-----------------------------------------------------------------------------
SUBJECT 1:
----------

According to "man grep" [[:alnum:]] = [0-9A-Za-z], and \w = synonym for 
[[:alnum:]], \W = synonym for [^[:alnum]].

My tests under Debian stable and SuSE 9.2 showed, it is
- TRUE  when the locale (LC_CTYPE, LANG) is ISO-8859-x or POSIX.
- FALSE when the locale is UTF-8. Then
        - \w          = wrong output
        - [[:alnum:]] = correct result
        - [0-9A-Za-z] = correct result
        - and the same when negated.

==> \w != [[:alnum:]] and \W != [^[:alnum:]]


Examples:
---------
  
  bla="a ä á 1 - §"

  - LANG and LC_CTYPE=de_DE.utf8 (also tested: en_US.UTF-8)
    ============================
      
      for i in $bla ; do egrep '\w' <<<$i ; done
      --> a 1
      
      for i in $bla ; do egrep '[[:alnum:]]' <<<$i ; done
      # OR: for i in $bla ; do egrep '[0-9A-Za-z]' <<<$i ; done
      --> a ä á 1
      
      for i in $bla ; do egrep '\W' <<<$i ; done
      --> ä á - §
      
      for i in $bla ; do egrep '[^[:alnum:]]' <<<$i ; done
      # OR: for i in $bla ; do egrep '[^0-9A-Za-z]' <<<$i ; done
      --> - §
      
      ==> [[:alnum:]] = [0-9A-Za-z] != \w
          [^[:alnum:]] = [^0-9A-Za-z] != \W
          ---------------------------------
          \w und \W = WRONG RESULT!!
  
  
  - address@hidden (also tested: de_DE, en_US)
    ===================
      
      for i in $bla ; do egrep '\w' <<<$i ; done
      # OR: for i in $bla ; do egrep '[[:alnum:]]' <<<$i ; done
      # OR: for i in $bla ; do egrep '[0-9A-Za-z]' <<<$i ; done
      --> a ä á 1
      
      for i in $bla ; do egrep '\W' <<<$i ; done
      # OR: for i in $bla ; do egrep '[^[:alnum:]]' <<<$i ; done
      # OR: for i in $bla ; do egrep '[^0-9A-Za-z]' <<<$i ; done
      --> - §
      
      ==> CORRECT: [[:alnum:]] = [0-9A-Za-z] = \w
                   [^[:alnum:]] = [^0-9A-Za-z] = \W
          -----------------------------------------
      
  
  - LC_CTYPE=POSIX (= all LC_-Settings and LANG unset)
    ==============
    
      for i in $bla ; do egrep '\w' <<<$i ; done
      # OR: for i in $bla ; do egrep '[[:alnum:]]' <<<$i ; done
      # OR: for i in $bla ; do egrep '[0-9A-Za-z]' <<<$i ; done
      --> a 1
      
      for i in $bla ; do egrep '\W' <<<$i ; done
      # OR: for i in $bla ; do egrep '[^[:alnum:]]' <<<$i ; done
      # OR: for i in $bla ; do egrep '[^0-9A-Za-z]' <<<$i ; done
      --> ä á - §
      
      ==> CORRECT: [[:alnum:]] = [0-9A-Za-z] = \w
                   [^[:alnum:]] = [^0-9A-Za-z] = \W
          -----------------------------------------
          Outputs were incorrect as expected, since POSIX doesn't
          recognise umlauts and accented chars as belonging to ASCII.

-----------------------------------------------------------------------------
SUBJECT 2:
----------

Range expressions like [A-Z] or [^a-z] work
- CORRECT when the locale (LC_CTYPE, LANG, LC_COLLATE) is ISO-8859-1(5),
- FALSE   when the locale is UTF-8.

POSIX character classes like [[:upper:]] or [^[:lower:]] work fine with
both (ISO-8859-x and UTF-8).

Examples:
---------
    
  var="A O Ö Ó a o ö ó"

  - LANG and LC_CTYPE=de_DE.utf8
    ============================
      
      for i in $var ; do egrep '[A-Z]' <<<"$i" ; done
      --> A O Ö Ó o ö ó
      *** WRONG ***
      
      for i in $var ; do egrep '[^a-z]' <<<"$i" ; done
      --> NO RESULT
      *** WRONG ***
      
      for i in $var ; do egrep '[[:upper:]]' <<<"$i" ; done
      # for i in $var ; do egrep '[^[:lower:]]' <<<"$i" ; done
      --> A O Ö Ó
      *** CORRECT ***
      
      ==> You see, it doesn't make much sense to set LC_COLLATE to C or
          POSIX: then I'd loose umlauts and accented chars.

  - address@hidden
    ===================
       
      for i in $var ; do egrep '[A-Z]' <<<"$i" ; done
      # for i in $var ; do egrep '[^a-z]' <<<"$i" ; done
      # for i in $var ; do egrep '[[:upper:]]' <<<"$i" ; done
      # for i in $var ; do egrep '[^[:lower:]]' <<<"$i" ; done
      --> A O Ö Ó
      *** CORRECT ***

-----------------------------------------------------------------------------

[Prev in Thread]

Current Thread

[Next in Thread]

grep, UTF-8, character classes, Lebens-Lust <=

Prev by Date: Re: gawk infinity issues
Next by Date: Re: New feature for diff
Previous by thread: Re: New feature for diff
Next by thread: [patch] Compilation of gettext-0.14.5 under MINGW32
Index(es):
- Date
- Thread