[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UTF-8 locale and \n in regexps
From: |
Pekka Pessi |
Subject: |
UTF-8 locale and \n in regexps |
Date: |
Thu, 19 Apr 2007 17:09:02 +0300 |
Hello,
It looks like regexp with \n in [^] behaves badly if locale has
an UTF-8 ctype.
It looks like if there is \n and an range without \n, like /\n[^x\n]foo/,
and first \n ends an even-numbered line within the string, regexp
does not match.
Please see the attached script for an demonstration.
--Pekka Pessi
#! /bin/sh
for LC_ALL in C UNKNOWN POSIX en_US.ISO-8859-1 en_US.UTF-8
do
export LC_ALL
cat <<EOF |
line1
line2
line3
line4
line5
line6
line7
line8
line9
EOF
gawk '
BEGIN { RS="\0"; }
{
if (match($0, /\n[^2\n]*2/)) { got2=1; } else { print "no match 2"; }
if (match($0, /\n[^3\n]*3/)) { got3=1; } else { print "no match 3"; }
if (match($0, /\n[^4\n]*4/)) { got4=1; } else { print "no match 4"; }
if (match($0, /\n[^5\t]*5/)) { got5=1; } else { print "no match 5"; }
if (match($0, /\n[^6\n]*6/)) { got6=1; } else { print "no match 6"; }
if (match($0, /\n[a-z]*7\n/)){ got7=1; } else { print "no match 7"; }
if (match($0, /\n[^8\n]*8/)) { got8=1; } else { print "no match 8"; }
if (match($0, /8.[^9\n]+9/)) { got9=1; } else { print "no match 9"; }
}
END { exit(!(got2 && got3 && got4 && got5 && got6 && got7 && got8 && got9)); }
' || {
echo LC_ALL=$LC_ALL FAILED
exit 1
}
echo LC_ALL=$LC_ALL passed
done
- UTF-8 locale and \n in regexps,
Pekka Pessi <=