[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....
From: |
David Bateman |
Subject: |
[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after .... |
Date: |
Tue, 09 Sep 2008 18:10:49 +0200 |
User-agent: |
Thunderbird 2.0.0.16 (X11/20080725) |
Ben Abbott wrote:
On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <address@hidden>
wrote:
Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length
lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok
but "(?<[a-z]*)" isn't. I'd hoped to replace this with
"(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is
not ok either. What I'd have to do is replace it with
((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH}))
which used the alternate operator and MALENGTH+1 copies of the
lookbehind expression to get the effect. This seems to be a ridiculous
amount of extra crap in the pattern space to get this functionality. Is
it worth supporting arbitrary length lookbehind expressions like
"(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is
it worth supporting it but limits max_length, and print a warning? If so
what value should be the limit?
Frankly I wonder how mathworks got this to work as they appear to be
using the Boost regex library which also doesn't support arbitrary
length lookbehind expressions....
D.
David,
Have you tried the example in Matlab?
Using 2007b, It does *not* work for me. My 2008a/b is busy running some
simulations, so I can't try it there until later.
g='x^(-1)+y(-1)+z(-1)=0';
regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
ans =
x^_minus1+y_minus1+z_minus1=0
If I understand correctly the result should be
ans =
x^(-1)+y_minus1+z_minus1=0
Correct?
Ben
The message
http://groups.google.com/group/comp.soft-sys.matlab/browse_thread/thread/babf37252132fd99/250b037e60b345ff?lnk=gst&q=lookbehind#250b037e60b345ff
seems to imply that mathworks have their own regexp engine and that
lookbehind is inefficient. I therefore don't consider it that much of an
issue to duplicate the lookbehind pattern in the pattern space and so
propose the attached changeset that replaces "(?>=[a-z]*)" with
"((?>=[a-z]{0})|(?>=[a-z]{1})|...(?>=[a-z]{10}))" before calling PCRE on
it. It also issues a warning about the maximum length string if the
lookbehind might be an issue. So the limitation is that "+" then
represents 1 to 10 characters and "*" 0 to 10 characters in a lookbehind
expression. This limitation doesn't apply to lookaheads, etc.
D.
--
David Bateman address@hidden
Motorola Labs - Paris +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin +33 6 72 01 06 33 (Mob)
91193 Gif-Sur-Yvette FRANCE +33 1 69 35 77 01 (Fax)
The information contained in this communication has been classified as:
[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary
# HG changeset patch
# User David Bateman <address@hidden>
# Date 1220976632 -7200
# Node ID efb313d0d98265789218594c6d3dd5d0e19f9ce1
# Parent 917a977250d994aceb97d2bd299e9a6048ca750f
Treat PCRE lookbehind operators in a manner that is approximately correct
diff --git a/src/ChangeLog b/src/ChangeLog
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,10 @@ 2008-09-02 Michael Goffioul <michael.g
+2008-09-09 David Bateman <address@hidden>
+
+ * DLD-FUNCTIONS/regexp.cc (octregexp_list): Distinguish between
+ matlab named tokens and perl lookbehind expressions. For
+ lookbehind expression replace "*" and "+" with a limited number of
+ fixed length expressions to simulate arbitrary length look behind.
+
2008-09-02 Michael Goffioul <address@hidden>
* graphics.cc (hggroup::update_axis_limits): Also reacts on
diff --git a/src/DLD-FUNCTIONS/regexp.cc b/src/DLD-FUNCTIONS/regexp.cc
--- a/src/DLD-FUNCTIONS/regexp.cc
+++ b/src/DLD-FUNCTIONS/regexp.cc
@@ -80,6 +80,9 @@ public:
typedef std::list<regexp_elem>::const_iterator const_iterator;
+#define MAXLOOKBEHIND 10
+static bool lookbehind_warned = false;
+
static int
octregexp_list (const octave_value_list &args, const std::string &nm,
bool case_insensitive, std::list<regexp_elem> &lst,
@@ -96,6 +99,9 @@ octregexp_list (const octave_value_list
once = false;
std::string buffer = args(0).string_value ();
+ size_t max_length = (buffer.length () > MAXLOOKBEHIND ?
+ MAXLOOKBEHIND: buffer.length ());
+
if (error_state)
{
gripe_wrong_type_arg (nm.c_str(), args(0));
@@ -190,12 +196,6 @@ octregexp_list (const octave_value_list
// named tokens "(?<name>...)" are only treated with PCRE not regex.
#if HAVE_PCRE
- // The syntax of named tokens in pcre is "(?P<name>...)" while we need
- // a syntax "(?<name>...)", so fix that here. Also an expression like
- // "(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)" should
- // be perfectly legal, while pcre does not allow the same named token
- // name on both sides of the alternative. Also fix that here by replacing
- // name tokens by dummy names, and dealing with the dummy names later.
size_t pos = 0;
size_t new_pos;
@@ -204,44 +204,131 @@ octregexp_list (const octave_value_list
std::ostringstream buf;
Array<int> named_idx;
- while ((new_pos = pattern.find ("(?<",pos)) != std::string::npos)
- {
- size_t tmp_pos = pattern.find_first_of ('>',new_pos);
-
- if (tmp_pos == std::string::npos)
+ while ((new_pos = pattern.find ("(?",pos)) != std::string::npos)
+ {
+ if (pattern.at (new_pos + 2) == '<' &&
+ !(pattern.at (new_pos + 3) == '=' ||
+ pattern.at (new_pos + 3) == '!'))
{
- error ("syntax error in pattern");
- break;
+ // The syntax of named tokens in pcre is "(?P<name>...)" while
+ // we need a syntax "(?<name>...)", so fix that here. Also an
+ // expression like
+ // "(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)"
+ // should be perfectly legal, while pcre does not allow the same
+ // named token name on both sides of the alternative. Also fix
+ // that here by replacing name tokens by dummy names, and dealing
+ // with the dummy names later.
+
+ size_t tmp_pos = pattern.find_first_of ('>',new_pos);
+
+ if (tmp_pos == std::string::npos)
+ {
+ error ("syntax error in pattern");
+ break;
+ }
+
+ std::string tmp_name =
+ pattern.substr(new_pos+3,tmp_pos-new_pos-3);
+ bool found = false;
+
+ for (int i = 0; i < nnames; i++)
+ if (named(i) == tmp_name)
+ {
+ named_idx.resize(inames+1);
+ named_idx(inames) = i;
+ found = true;
+ break;
+ }
+ if (! found)
+ {
+ named_idx.resize(inames+1);
+ named_idx(inames) = nnames;
+ named.append(tmp_name);
+ nnames++;
+ }
+
+ if (new_pos - pos > 0)
+ buf << pattern.substr(pos,new_pos-pos);
+ if (inames < 10)
+ buf << "(?P<n00" << inames++;
+ else if (inames < 100)
+ buf << "(?P<n0" << inames++;
+ else
+ buf << "(?P<n" << inames++;
+ pos = tmp_pos;
}
-
- std::string tmp_name = pattern.substr(new_pos+3,tmp_pos-new_pos-3);
- bool found = false;
-
- for (int i = 0; i < nnames; i++)
- if (named(i) == tmp_name)
- {
- named_idx.resize(inames+1);
- named_idx(inames) = i;
- found = true;
- break;
- }
- if (! found)
+ else if (pattern.at (new_pos + 2) == '<')
{
- named_idx.resize(inames+1);
- named_idx(inames) = nnames;
- named.append(tmp_name);
- nnames++;
+ // Find lookbehind operators of arbitrary length (ie like
+ // "(?<=[a-z]*)") and replace with a maximum length operator
+ // as PCRE can not yet handle arbitrary length lookahead
+ // operators. Use the string length as the maximum length to
+ // avoid issues.
+
+ int brackets = 1;
+ size_t tmp_pos1 = new_pos + 2;
+ size_t tmp_pos2 = tmp_pos1;
+ while (tmp_pos1 <= pattern.length () && brackets > 0)
+ {
+ char ch = pattern.at (tmp_pos1);
+ if (ch == '(')
+ brackets++;
+ else if (ch == ')')
+ {
+ if (brackets > 1)
+ tmp_pos2 = tmp_pos1;
+
+ brackets--;
+ }
+ tmp_pos1++;
+ }
+
+ if (brackets != 0)
+ {
+ buf << pattern.substr (pos, new_pos - pos) << "(?";
+ pos = new_pos + 2;
+ }
+ else
+ {
+ size_t tmp_pos3 = pattern.find_first_of ("*+", tmp_pos2);
+ if (tmp_pos3 != std::string::npos && tmp_pos3 < tmp_pos1)
+ {
+ if (!lookbehind_warned)
+ {
+ lookbehind_warned = true;
+ warning ("%s: arbitrary length lookbehind patterns
are only support up to length %d", nm.c_str(), MAXLOOKBEHIND);
+ }
+
+ buf << pattern.substr (pos, new_pos - pos) << "(";
+
+ size_t i;
+ if (pattern.at (tmp_pos3) == '*')
+ i = 0;
+ else
+ i = 1;
+
+ for (; i < max_length + 1; i++)
+ {
+ buf <<pattern.substr(new_pos, tmp_pos3 - new_pos)
+ << "{" << i << "}";
+ buf << pattern.substr(tmp_pos3 + 1,
+ tmp_pos1 - tmp_pos3 - 1);
+ if (i != max_length)
+ buf << "|";
+ }
+ buf << ")";
+ }
+ else
+ buf << pattern.substr (pos, tmp_pos1 - pos);
+ pos = tmp_pos1;
+ }
}
-
- if (new_pos - pos > 0)
- buf << pattern.substr(pos,new_pos-pos);
- if (inames < 10)
- buf << "(?P<n00" << inames++;
- else if (inames < 100)
- buf << "(?P<n0" << inames++;
else
- buf << "(?P<n" << inames++;
- pos = tmp_pos;
+ {
+ buf << pattern.substr (pos, new_pos - pos) << "(?";
+ pos = new_pos + 2;
+ }
+
}
buf << pattern.substr(pos);
@@ -1041,6 +1128,16 @@ The pattern is taken literally.\n\
%!assert(regexp({'asdfg-dfd','-dfd-dfd-','qasfdfdaq'},'-'),{6,[1,5,9],zeros(1,0)})
%!assert(regexp({'asdfg-dfd';'-dfd-dfd-';'qasfdfdaq'},{'-';'f';'q'}),{6;[3,7];[1,9]})
%!assert(regexp('Strings',{'t','s'}),{2,7})
+
+## Test case for lookaround operators
+%!assert(regexp('Iraq','q(?!u)'),4)
+%!assert(regexp('quit','q(?!u)'), zeros(1,0))
+%!assert(regexp('quit','q(?=u)','match'), {'q'})
+%!assert(regexp("quit",'q(?=u+)','match'), {'q'})
+%!assert(regexp("qit",'q(?=u+)','match'), cell(1,0))
+%!assert(regexp("qit",'q(?=u*)','match'), {'q'})
+
+%!assert(regexp('thingamabob','(?<=a)b'), 9)
*/
@@ -1569,6 +1666,9 @@ Alternatively, use (?x) or (?-x) in the
%!assert(regexprep({"abc","cba"},"b","?"),{"a?c","c?a"})
%!assert(regexprep({"abc","cba"},{"b","a"},{"?","!"}),{"!?c","c?!"})
+# Nasty lookbehind expression
+%!assert(regexprep('x^(-1)+y(-1)+z(-1)=0','(?<=[a-z]+)\(\-[1-9]*\)','_minus1'),'x^(-1)+y_minus1+z_minus1=0')
+
*/
/*
- Aw: Re: regexp: matching expressions b4 and after ...., giovanni . lombardo, 2008/09/09
- Re: Aw: Re: regexp: matching expressions b4 and after ...., David Bateman, 2008/09/09
- [Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ...., David Bateman, 2008/09/09
- Re: Aw: Re: regexp: matching expressions b4 and after ...., David Bateman, 2008/09/09
- Re: Aw: Re: regexp: matching expressions b4 and after ...., Ben Abbott, 2008/09/09
- [Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....,
David Bateman <=
- [Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ...., John W. Eaton, 2008/09/09
- Re: [Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ...., Jaroslav Hajek, 2008/09/10
- Re: [Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ...., David Bateman, 2008/09/10
Re: Aw: Re: regexp: matching expressions b4 and after ...., Ben Abbott, 2008/09/09