Re: bug in gawk 3.1.1?

bug-gnu-utils
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bug in gawk 3.1.1?

From:	Aharon Robbins
Subject:	Re: bug in gawk 3.1.1?
Date:	Wed, 4 Sep 2002 14:06:56 +0300
Greetings. Re this:

In article <address@hidden>,
LorenzAtWork  <address@hidden> wrote:
>hello all,
>
>I'm using the following script
>
>BEGIN {RS="ti1\n(dwv,)?"; s=0; i=0}
>{if ($1 != "") s = $1; print ++i, s}
>
>to extract values from a file of the form
>
>ti1
>dwv,98.22
>ti1
>dwv,103.08
>ti1
>ti1
>dwv,196.25
>ti1
>dwv,210.62
>ti1
>dwv,223.53
>
>The desired result for this example looks like
>
>1 0
>2 98.22
>3 103.08
>4 103.08
>5 196.25
>6 210.62
>7 223.53
>
>The script work fine the most time, but when run on the attached file
>(sorry for the size, but the error would not appear with less data) I
>get some (three with the attached file) lines that look like
>
>1262 dwv,212.97
>1277 dwv,174.33
>1279 dwv,151.79
>
>I can't think of a other reason for this than a bug in gawk!
>
>I'm running gawk 3.1.1 on winnt 4.0
>
>best regards
>     Lorenz

First and formost, let me emphasize, reiterate, and remind:

        PLEASE SEND BUG REPORTS TO address@hidden !!!!!!!!!

Posting and discussing bugs in comp.lang.awk is lot of fun, but a
terribly unreliable way to get anything done about them.  My attention
to the group varies, to say the least.

Now, on to the problem.  I was able to reproduce the problem with the
posted data file on my GNU/Linux system.  Interestingly, mawk didn't
have a problem, although others reported mawk showing the same thing on
different systems.

I just finished well over 4 hours of work tracing this down.  The fix
is below, and at least solves the problem for this case.  I'm not sure
there's a general solution.

The bug is an issue of regular expression matching across buffer
boundaries.  This is why different platforms got different results.
Look at the regex involved:

        RS = "ti1\n(dwv,)?"

Now, consider the case of the input file, in the input buffer like so:


        ---+---+---+----+---+
        ...| t | 1 | \n | d | <--- last character in buffer
        ---+---+---+----+---+

With the "wv," still waiting to be read in from the file.

In this case, the regex matcher *successfully* matches the RS regex,
since the "dwv," part is optional.

I fixed this basically by using a heuristic.  If the RS regex
ends in ?, *, or +, and the end of the regex match is within
a few bytes of the end of the buffer, then read in some more
text and try again.  This solves the problem for the particular
program and test data, but as I said, isn't necessarily a
complete and general fix.

(I think the only complete fix would be to suck the entire file into
memory. This I'm unwilling to do, and don't suggest using mmap, I've
been down that road already.)

In any case, the following patch fixes this on my system, without
breaking anything else in the test suite.

Enjoy,

Arnold
------------------------------------------------------------------
*** io.c.save   Wed Aug 21 15:21:05 2002
--- io.c        Wed Sep  4 13:17:37 2002
***************
*** 355,367 ****
--- 355,383 ----
  {
        IOBUF *iop;
        extern int exiting;
+       int rval1, rval2, rval3;
  
        (void) setjmp(filebuf); /* for `nextfile' */
  
        while ((iop = nextfile(FALSE)) != NULL) {
+               /*
+                * This was:
                if (inrec(iop) == 0)
                        while (interpret(expression_value) && inrec(iop) == 0)
                                continue;
+                * Now expand it out for ease of debugging.
+                */
+               rval1 = inrec(iop);
+               if (rval1 == 0) {
+                       for (;;) {
+                               rval2 = rval3 = -1;     /* for debugging */
+                               rval2 = interpret(expression_value);
+                               if (rval2 != 0)
+                                       rval3 = inrec(iop);
+                               if (rval2 == 0 || rval3 != 0)
+                                       break;
+                       }
+               }
                if (exiting)
                        break;
        }
***************
*** 2447,2452 ****
--- 2463,2469 ----
                                bufend = iop->buf + iop->size + iop->secsiz;
                                *bufend = rs;
                        }
+ 
                        if (len > 0) {
                                char *newsplit = iop->buf + iop->secsiz;
  
***************
*** 2551,2579 ****
                 */
                continuing = FALSE;
                if (rsre != NULL) {
!               again:
!                       /* cases 1 and 2 are simple, just keep going */
!                       if (research(rsre, start, 0, iop->end - start, TRUE) == 
-1
!                           || RESTART(rsre, start) == REEND(rsre, start)) {
                                /*
!                                * Leading newlines at the beginning of the file
!                                * should be ignored. Whew!
                                 */
!                               if (RS_is_null && *start == '\n'
!                                               && start < iop->end) {
!                                       /*
!                                        * have to catch the case of a
!                                        * single newline at the front of
!                                        * the record, which the regex
!                                        * doesn't. gurr.
!                                        */
!                                       while (*start == '\n' && start < 
iop->end)
!                                               start++;
!                                       goto again;
                                }
                                bp = iop->end;
                                continue;
                        }
                        /* case 3, regex match at exact end */
                        if (start + REEND(rsre, start) >= iop->end) {
                                if (iop->cnt != EOF) {
--- 2568,2606 ----
                 */
                continuing = FALSE;
                if (rsre != NULL) {
!                       /*
!                        * Leading newlines at the beginning of the file
!                        * should be ignored. Whew!
!                        */
!                       if (RS_is_null && *start == '\n'
!                                       && start < iop->end) {
                                /*
!                                * have to catch the case of a
!                                * single newline at the front of
!                                * the record, which the regex
!                                * doesn't. gurr.
                                 */
!                               while (*start == '\n' && start < iop->end)
!                                       start++;
!                               if (start == iop->end) {
!                                       bp = iop->end;
!                                       continue;
                                }
+                       }
+ 
+               again:
+                       /* case 1 is simple, just keep going */
+                       if (research(rsre, start, 0, iop->end - start, TRUE) == 
-1) {
+                               bp = iop->end;
+                               continue;
+                       }
+ 
+                       /* case 2 is simple, just keep going */
+                       if (RESTART(rsre, start) == REEND(rsre, start)) {
                                bp = iop->end;
                                continue;
                        }
+ 
                        /* case 3, regex match at exact end */
                        if (start + REEND(rsre, start) >= iop->end) {
                                if (iop->cnt != EOF) {
***************
*** 2591,2596 ****
--- 2618,2657 ----
                                        }
                                }
                        }
+ 
+                       /*
+                        * case 4, match succeeded, but there may be more in
+                        * the next input buffer.
+                        *
+                        * Consider an RS of   xyz(abc)?   where the
+                        * exact end of the buffer is   xyza  and the
+                        * next two, unread characters, are ab.
+ 
+                        * This matches the "xyz" and ends up putting the
+                        * "abc" into the front of the next record. Ooops.
+                        *
+                        * The test for a *, +, or ? at the end of the RE
+                        * is a heuristic (spelled k l u d g e).
+                        */
+                       /* succession of tests is easier to trace in GDB. */
+                       if (iop->cnt != EOF) {
+                               if (strchr("+*?", RS->stptr[RS->stlen-1]) != 
NULL) {
+                                       if ((iop->end - 
(start+REEND(rsre,start))) < RS->stlen) {
+                                               bp = iop->end;
+                                               continuing = TRUE;
+                                               continue;
+                                       }
+                               }
+                       }
+ 
+                       /*
+                        * iop->cnt could be set to EOF from extra scanning but
+                        * there may still be characters left in the buffer.
+                        * Ugh.
+                        */
+                       if (iop->cnt == EOF && iop->end > iop->off)
+                               iop->cnt = iop->end - iop->off;
+ 
                        /* got a match! */
                        /*
                         * Leading newlines at the beginning of the file
-- 
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd.     address@hidden
P.O. Box 354            Home Phone: +972  8 979-0381    Fax: +1 928 569 9018
Nof Ayalon              Cell Phone: +972 51  297-545
D.N. Shimshon 99785     ISRAEL
[Prev in Thread]
Current Thread
[Next in Thread]
Re: bug in gawk 3.1.1?, Aharon Robbins <=
- Re: bug in gawk 3.1.1?, Paul Eggert, 2002/09/04
- Re: bug in gawk 3.1.1?, Aharon Robbins, 2002/09/05
Prev by Date: Re: July 31 bfd change breaks mips gdb
Next by Date: 0.11.5: gettextize with external
Previous by thread: July 31 bfd change breaks mips gdb
Next by thread: Re: bug in gawk 3.1.1?
Index(es):
- Date
- Thread