i18n: special letter(s?) cause regular expression error in match() and w

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

i18n: special letter(s?) cause regular expression error in match() and w

From:	R. Bijlsma
Subject:	i18n: special letter(s?) cause regular expression error in match() and wrong length()
Date:	Sat, 03 Jan 2009 18:44:27 +0100
User-agent:	Icedove 1.5.0.14eol (X11/20080724)


Dear gawk writers,

The attached script bug_special_letters.awk proves and analysis a bugin gawk 3.1.6, namely that the special letter é (='e) is treated badlyby match() and by length().


System: The precompiled package of Ubuntu 8.10, Intrepid.

The script is its own test input, and also contains all explanations,here a summary:


In match(), the /./ expression does not match the special letter é (='e),
while it is matched by ~.

Furthermore, match() may match a line containing an é, but set RLENGTHto -2.Length() counts upto the first occurance of é and ignores the rest ofthe line.

It seems that the problem is related to the chosen language andcharacter encoding.

I have LANG=en_US.UTF-8.

  Best regards,
                            Rita Bylsma

#
# NAME
#    bug_special_letters.awk :: Proves and analysis a bug in gawk 3.1.6 (The 
precompiled package of Ubuntu 8.10, Intrepid).
#                               Namely that the special letter é (='e) is 
treated badly by match() and by length().
#
# SYNOPSIS     
#    grep '^# ' bug_special_letters.awk                        To read the 
proof as found on my system. 
#    gawk -f bug_special_letters.awk bug_special_letters.awk   To test on your 
own system. 
#
#
# DESCRIPTION
#
#  In match(), the /./ expression does not match the special letter é (='e), 
while it is matched by ~.
#  Furthermore, match() may match a line containing an é, but set RLENGTH to -2.
#  Length() counts upto the first occurance of é and ignores the rest of the 
line.
#
#  Other special letters where not tested. 
#
#  It seems that the problem is related to the chosen language and character 
encoding. 
#  I have LANG=en_US.UTF-8.
#
#  Below is the output of my test, as found on my system. There are colorcodes 
in it to
#  see immediately where the errors are, if read on the console, for example 
with grep.
#
#  Note that some errors can generally be detected by this script, but in some 
cases
#  gawk has no way of detecting that it is doing something wrong. 
#  For an example of such cases the colorcodes are hardcoded.
#  
#
# ________________________________________________________________________
# 'e'
# Length: 1
# !~ /(.+de)(.+)/         with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=0, RLENGTH=-1
# ~  /.+/                 with both ~ and match(), RSTART=1, RLENGTH=1
# ________________________________________________________________________
# 'é'
# [1;31mLength: 0 (should be: 1)[0m
# !~ /(.+de)(.+)/         with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=0, RLENGTH=-1
# [1;31m~  /.+/                 with ~
# !~ /.+/                 with match(), RSTART=0, RLENGTH=-1
# [0m________________________________________________________________________
# 'Na deze regel één lege regel. After this line one empty line.'
# [1;31mLength: 14 (should be: 61)[0m
# ~  /(.+de)(.+)/         with both ~ and match(), RSTART=1, 
[1;31mRLENGTH=14[0m
#                         1 (.+de): 'Na de', length = Length: 5
#                         2 (.+): '[1;31mze regel [0m', length = Length: 9
# ~  /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=4, 
[1;31mRLENGTH=-2[0m
#                         1 (de[^\"]+): 'deze regel één lege regel', length = 
[1;31mLength: 11 (should be: 25)[0m
#                         2 (.+): ' After this line one empty line.', length = 
Length: 32
# ~  /.+/                 with both ~ and match(), RSTART=1, 
[1;31mRLENGTH=14[0m
# ________________________________________________________________________
# ''
# Length: 0
# !~ /(.+de)(.+)/         with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /.+/                 with both ~ and match(), RSTART=0, RLENGTH=-1
# ________________________________________________________________________
# 'Voor deze regel een lege regel. Before this line an empty line.'
# Length: 63
# ~  /(.+de)(.+)/         with both ~ and match(), RSTART=1, RLENGTH=63
#                         1 (.+de): 'Voor de', length = Length: 7
#                         2 (.+): 'ze regel een lege regel. Before this line an 
empty line.', length = Length: 56
# ~  /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=6, RLENGTH=58
#                         1 (de[^\"]+): 'deze regel een lege regel', length = 
Length: 25
#                         2 (.+): ' Before this line an empty line.', length = 
Length: 32
# ~  /.+/                 with both ~ and match(), RSTART=1, RLENGTH=63
#
#
#

## Test input for the program. All testline start with '#: ' (note the space).
#: e
#: é
#: Na deze regel één lege regel. After this line one empty line.
#: 
#: Voor deze regel een lege regel. Before this line an empty line.


BEGIN {

#regex_ar[ "[^\\\"]" ]
#regex_ar[ "" ]
 regex_ar[ ".+" ]
 regex_ar[ "(.+de)(.+)" ]
 regex_ar[ "(de[^\\\"]+)\\.(.+)" ]

 afkap_example="Na deze regel één lege regel. After this line one empty line."
 afkap_regex="(.+de)(.+)"
 full_regex=".+" 
 
}

 function op_error( if_error, string, explanation ) {
 
   if ( if_error )
     return ( "\033[1;31m" string explanation "\033[0m" )
   else
     return string 
}

function getlength( string     ,len,after,replacement,lenrepl) {

   len=length(string)
      
   replacement=string
   gsub(/[^\"]/, "x", replacement)
   lenrepl=length(replacement)    

   return op_error( lenrepl != len, "Length: " len, " (should be: " lenrepl ")")

}

 /^\#: / {
  
   example=$0
   sub(/^\#: /, "", example)

   printf 
"________________________________________________________________________\n"
   printf "'%s'\n", example  

   print getlength(example)

   for ( regex in regex_ar )
   {  if ( example ~ regex ) 
        opmatchstr="~"
      else 
        opmatchstr="!~"

      if ( match( example, regex, matchAr )) 
        funcmatchstr="~"
      else 
        funcmatchstr="!~"

      if ( funcmatchstr == "~" && RLENGTH < 0 || ( ( regex == afkap_regex || 
regex == full_regex ) && example == afkap_example) )
        rlength_error=1
      else
        rlength_error=0

      if ( opmatchstr == funcmatchstr )
        printf "%-3s%-20s with both ~ and match(), RSTART=%s, %s\n", 
funcmatchstr, ("/" regex "/"), RSTART, op_error( rlength_error, "RLENGTH=" 
RLENGTH) 
      else
      { printf "\033[1;31m"
        printf "%-3s%-20s with ~\n", opmatchstr, ("/" regex "/")
        printf "%-3s%-20s with match(), RSTART=%s, %s\n", funcmatchstr, ("/" 
regex "/"), RSTART, op_error(rlength_error, "RLENGTH=" RLENGTH) 
        printf "\033[0m"
      }
      if ( funcmatchstr == "~" )
        { subar["nr"]=split(regex, subar, "(")
          for (i=2; i<=subar["nr"]; i++)
          { sub(/\).*/, "", subar[i])
            printf "%25d %s: '%s', length = %s\n", i-1, "(" subar[i] ")", 
op_error(regex == afkap_regex && example == afkap_example && i==3, matchAr[i-1] 
), getlength(matchAr[i-1])
          }
        }
   }
  
}

[Prev in Thread]

Current Thread

[Next in Thread]

i18n: special letter(s?) cause regular expression error in match() and wrong length(), R. Bijlsma <=
- Re: i18n: special letter(s?) cause regular expression error in match() and wrong length(), Seb, 2009/01/04

Prev by Date: Re: [gawk-stable] bug: fatal error when getline from directory
Next by Date: Re: [gawk-stable] bug: fatal error when getline from directory
Previous by thread: Re: Please do not install a charset.alias file under Mac OS X
Next by thread: Re: i18n: special letter(s?) cause regular expression error in match() and wrong length()
Index(es):
- Date
- Thread