bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

i18n: special letter(s?) cause regular expression error in match() and w


From: R. Bijlsma
Subject: i18n: special letter(s?) cause regular expression error in match() and wrong length()
Date: Sat, 03 Jan 2009 18:44:27 +0100
User-agent: Icedove 1.5.0.14eol (X11/20080724)


Dear gawk writers,

The attached script bug_special_letters.awk proves and analysis a bug in gawk 3.1.6, namely that the special letter é (='e) is treated badly by match() and by length().

System: The precompiled package of Ubuntu 8.10, Intrepid.

The script is its own test input, and also contains all explanations, here a summary:

In match(), the /./ expression does not match the special letter é (='e),
while it is matched by ~.
Furthermore, match() may match a line containing an é, but set RLENGTH to -2. Length() counts upto the first occurance of é and ignores the rest of the line.

It seems that the problem is related to the chosen language and character encoding.
I have LANG=en_US.UTF-8.

  Best regards,
                            Rita Bylsma
#
# NAME
#    bug_special_letters.awk :: Proves and analysis a bug in gawk 3.1.6 (The 
precompiled package of Ubuntu 8.10, Intrepid).
#                               Namely that the special letter é (='e) is 
treated badly by match() and by length().
#
# SYNOPSIS     
#    grep '^# ' bug_special_letters.awk                        To read the 
proof as found on my system. 
#    gawk -f bug_special_letters.awk bug_special_letters.awk   To test on your 
own system. 
#
#
# DESCRIPTION
#
#  In match(), the /./ expression does not match the special letter é (='e), 
while it is matched by ~.
#  Furthermore, match() may match a line containing an é, but set RLENGTH to -2.
#  Length() counts upto the first occurance of é and ignores the rest of the 
line.
#
#  Other special letters where not tested. 
#
#  It seems that the problem is related to the chosen language and character 
encoding. 
#  I have LANG=en_US.UTF-8.
#
#  Below is the output of my test, as found on my system. There are colorcodes 
in it to
#  see immediately where the errors are, if read on the console, for example 
with grep.
#
#  Note that some errors can generally be detected by this script, but in some 
cases
#  gawk has no way of detecting that it is doing something wrong. 
#  For an example of such cases the colorcodes are hardcoded.
#  
#
# ________________________________________________________________________
# 'e'
# Length: 1
# !~ /(.+de)(.+)/         with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=0, RLENGTH=-1
# ~  /.+/                 with both ~ and match(), RSTART=1, RLENGTH=1
# ________________________________________________________________________
# 'é'
# Length: 0 (should be: 1)
# !~ /(.+de)(.+)/         with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=0, RLENGTH=-1
# ~  /.+/                 with ~
# !~ /.+/                 with match(), RSTART=0, RLENGTH=-1
# ________________________________________________________________________
# 'Na deze regel één lege regel. After this line one empty line.'
# Length: 14 (should be: 61)
# ~  /(.+de)(.+)/         with both ~ and match(), RSTART=1, 
RLENGTH=14
#                         1 (.+de): 'Na de', length = Length: 5
#                         2 (.+): 'ze regel ', length = Length: 9
# ~  /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=4, 
RLENGTH=-2
#                         1 (de[^\"]+): 'deze regel één lege regel', length = 
Length: 11 (should be: 25)
#                         2 (.+): ' After this line one empty line.', length = 
Length: 32
# ~  /.+/                 with both ~ and match(), RSTART=1, 
RLENGTH=14
# ________________________________________________________________________
# ''
# Length: 0
# !~ /(.+de)(.+)/         with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /.+/                 with both ~ and match(), RSTART=0, RLENGTH=-1
# ________________________________________________________________________
# 'Voor deze regel een lege regel. Before this line an empty line.'
# Length: 63
# ~  /(.+de)(.+)/         with both ~ and match(), RSTART=1, RLENGTH=63
#                         1 (.+de): 'Voor de', length = Length: 7
#                         2 (.+): 'ze regel een lege regel. Before this line an 
empty line.', length = Length: 56
# ~  /(de[^\"]+)\.(.+)/   with both ~ and match(), RSTART=6, RLENGTH=58
#                         1 (de[^\"]+): 'deze regel een lege regel', length = 
Length: 25
#                         2 (.+): ' Before this line an empty line.', length = 
Length: 32
# ~  /.+/                 with both ~ and match(), RSTART=1, RLENGTH=63
#
#
#

## Test input for the program. All testline start with '#: ' (note the space).
#: e
#: é
#: Na deze regel één lege regel. After this line one empty line.
#: 
#: Voor deze regel een lege regel. Before this line an empty line.


BEGIN {

#regex_ar[ "[^\\\"]" ]
#regex_ar[ "" ]
 regex_ar[ ".+" ]
 regex_ar[ "(.+de)(.+)" ]
 regex_ar[ "(de[^\\\"]+)\\.(.+)" ]

 afkap_example="Na deze regel één lege regel. After this line one empty line."
 afkap_regex="(.+de)(.+)"
 full_regex=".+" 
 
}

 function op_error( if_error, string, explanation ) {
 
   if ( if_error )
     return ( "\033[1;31m" string explanation "\033[0m" )
   else
     return string 
}

function getlength( string     ,len,after,replacement,lenrepl) {

   len=length(string)
      
   replacement=string
   gsub(/[^\"]/, "x", replacement)
   lenrepl=length(replacement)    

   return op_error( lenrepl != len, "Length: " len, " (should be: " lenrepl ")")

}

 /^\#: / {
  
   example=$0
   sub(/^\#: /, "", example)

   printf 
"________________________________________________________________________\n"
   printf "'%s'\n", example  

   print getlength(example)

   for ( regex in regex_ar )
   {  if ( example ~ regex ) 
        opmatchstr="~"
      else 
        opmatchstr="!~"

      if ( match( example, regex, matchAr )) 
        funcmatchstr="~"
      else 
        funcmatchstr="!~"

      if ( funcmatchstr == "~" && RLENGTH < 0 || ( ( regex == afkap_regex || 
regex == full_regex ) && example == afkap_example) )
        rlength_error=1
      else
        rlength_error=0

      if ( opmatchstr == funcmatchstr )
        printf "%-3s%-20s with both ~ and match(), RSTART=%s, %s\n", 
funcmatchstr, ("/" regex "/"), RSTART, op_error( rlength_error, "RLENGTH=" 
RLENGTH) 
      else
      { printf "\033[1;31m"
        printf "%-3s%-20s with ~\n", opmatchstr, ("/" regex "/")
        printf "%-3s%-20s with match(), RSTART=%s, %s\n", funcmatchstr, ("/" 
regex "/"), RSTART, op_error(rlength_error, "RLENGTH=" RLENGTH) 
        printf "\033[0m"
      }
      if ( funcmatchstr == "~" )
        { subar["nr"]=split(regex, subar, "(")
          for (i=2; i<=subar["nr"]; i++)
          { sub(/\).*/, "", subar[i])
            printf "%25d %s: '%s', length = %s\n", i-1, "(" subar[i] ")", 
op_error(regex == afkap_regex && example == afkap_example && i==3, matchAr[i-1] 
), getlength(matchAr[i-1])
          }
        }
   }
  
}



reply via email to

[Prev in Thread] Current Thread [Next in Thread]