[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 locale and \n in regexps
From: |
Aharon Robbins |
Subject: |
Re: UTF-8 locale and \n in regexps |
Date: |
Tue, 24 Apr 2007 22:02:18 +0300 |
Greetings. Re this:
> Date: Thu, 19 Apr 2007 17:09:02 +0300
> From: Pekka Pessi <address@hidden>
> Subject: UTF-8 locale and \n in regexps
> To: address@hidden
> Cc: address@hidden
>
> Hello,
>
> It looks like regexp with \n in [^] behaves badly if locale has
> an UTF-8 ctype.
>
> It looks like if there is \n and an range without \n, like /\n[^x\n]foo/,
> and first \n ends an even-numbered line within the string, regexp
> does not match.
>
> Please see the attached script for an demonstration.
>
> --Pekka Pessi
>
> [ test case removed ]
As I mentioned in my earlier mail, the match function should be using the
full matcher. Gawk was relying on the dfa matcher to say if there really
is a match or not, and the dfa matcher is (unfortunately) lieing. With
the following workaround, gawk behaves correctly.
This will make its way to the CVS archive soon.
I will be adding your program to the test suite, if you don't mind.
Thanks,
Arnold
------------------------------------------------------
Tue Apr 24 21:55:36 2007 Arnold D. Robbins <address@hidden>
* re.c (research): In the multibyte case, fall back to the full
matcher if need_start, since there are bugs in the dfa matcher
in some obscure cases. Sigh.
===================================================================
RCS file: /d/mongo/cvsrep/gawk-stable/re.c,v
retrieving revision 1.2
diff -u -r1.2 re.c
--- re.c 6 Apr 2007 12:49:08 -0000 1.2
+++ re.c 24 Apr 2007 18:55:21 -0000
@@ -225,8 +225,15 @@
*
* The dfa matcher doesn't have a no_bol flag, so don't bother
* trying it in that case.
+ *
+ * 4/2007: Grrrr. The dfa matcher has bugs in certain multibyte
+ * cases that are just too deeply buried to ferret out. Don't
+ * let this kill us if we need_start. (This may be too narrowly
+ * focused, perhaps we should relegate the DFA matcher to the
+ * single byte case all the time. OTOH, the speed difference
+ * between the matchers in non-trivial... Sigh.)
*/
- if (rp->dfa && ! no_bol) {
+ if (rp->dfa && ! no_bol && (gawk_mb_cur_max == 1 || ! need_start)) {
char save;
int count = 0;
/*