bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

gensub RE problem?


From: Jim Hart
Subject: gensub RE problem?
Date: Fri, 6 Sep 2002 09:37:07 -0400

GNU Awk 3.1.0, compiled from source on Darwin.

I believe I've found a problem in gensub's handling of regular expression matching. The string to be matched against is:

<div class="bodytext">                                   <a href="/2
/hi/science/nature/2212629.stm"><img height="120" hspace="5" vspace="0" width="1 00" border="0" src="/media/images/38213000/jpg/_38213850_websmall.jpg" align="le ft"></a> <a href= "/2/hi/science/nature/2212629.stm"><span class="h1">Global body needed to fight poverty</span></a><br> A new global body for economic developme nt is needed, says The Lancet in the lead-up to the Sustainable Development Summ
it.<br clear="ALL">     </div>

The gensub command:

gensub(/.*<br>([^<]*)</br.*|.*/,"\\1",1,itemString)

returns:

                      A new global body for economic developme
nt is needed, says The Lancet in the lead-up to the Sustainable Development Summ
it.

as one would expect. Whereas:

gensub(/.*<br>(.*)</br.*|.*/,"\\1",1,itemString)

returns null. And:

gensub(/.*<br>(  *)([^<]*)</br.*|.*/,"\\1",1,itemString)

returns only one space, not the many that follow the <br>. And:

gensub(/.*<br>(  *)([^<]*)</br.*|.*/,"\\2",1,itemString)

returns the same thing as the first example, with all the leading spaces included.


The man page re_format says:

"In  the  event  that  an RE could match more than one sub-
       string of a given string, the RE matches the one  starting
       earliest  in  the string.  If the RE could match more than
       one substring starting  at  that  point,  it  matches  the
       longest.   Subexpressions  also match the longest possible
       substrings, subject to the constraint that the whole match
       be  as long as possible, with subexpressions starting ear-
       lier in the RE taking priority over ones  starting  later."

Note that last part. The earlier subexpression should be extending to maximum length. Yet, it appears that gensub is returning the shortest possible match for ( *) and (.*), not the longest.

Comments? Opinions? Did I miss something? Is gawk just calling OS routines so the problem is actually in Darwin?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]