bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: csplit corrupts files with long lines


From: Jim Meyering
Subject: Re: csplit corrupts files with long lines
Date: Sat, 10 Sep 2005 16:33:40 +0200

Tristan Miller <address@hidden> wrote:
> I'm reporting a bug with csplit coreutils 5.2.1, compiled from sources on a
> SuSE 9.3 system.  It seems this bug was previously reported over a year
> ago (see
> <http://lists.gnu.org/archive/html/bug-coreutils/2004-08/msg00112.html>)
> but it was never squashed.
>
> In short, csplit produces corrupt output when the input file contains very
> long lines.  An example file is at
> <http://www.dfki.uni-kl.de/~miller/tmp/wikipedia.xml>, an XML file
> containing three articles from Wikipedia.  The second article was
> vandalized by a spammer who inserted a ridiculously long line (42280
> characters) full of links.
>
> If I try to split this file with
>
> $ csplit wikipedia.xml '/<page>/' '{*}'
>
> then the file with the second article, xx02, is garbled at the beginning of
> the long line.  See <http://www.dfki.uni-kl.de/~miller/tmp/xx02>.

Thanks for reporting that!
Here's the fix I've just checked in:
[there's still a minor memory leak, but we'll deal with that separately]

2005-09-10  Jim Meyering  <address@hidden>

        csplit could produce corrupt output, given input lines longer than 8KB

        * src/csplit.c (load_buffer): Don't read from free'd memory
        when handling lines longer than the initial buffer length.
        (save_to_hold_area): Don't leak the previous hold_area buffer.
        Reported by Tristan Miller and Luke Kendall.
        * NEWS: Mention this.
        * tests/misc/csplit: New test for this.

        * src/csplit.c (load_buffer): Avoid integer overflow in buffer
        size calculations for very long lines.

Index: NEWS
===================================================================
RCS file: /fetish/cu/NEWS,v
retrieving revision 1.307
retrieving revision 1.308
diff -u -p -u -r1.307 -r1.308
--- NEWS        10 Sep 2005 00:08:28 -0000      1.307
+++ NEWS        10 Sep 2005 14:07:59 -0000      1.308
@@ -136,6 +136,8 @@ GNU coreutils NEWS
   permissions like =xX and =u, and did not properly diagnose some invalid
   strings like g+gr, ug,+x, and +1.  These bugs have been fixed.
 
+  csplit could produce corrupt output, given input lines longer than 8KB
+
   dd now computes statistics using a realtime clock (if available)
   rather than the time-of-day clock, to avoid glitches if the
   time-of-day is changed while dd is running.  Also, it avoids
Index: src/csplit.c
===================================================================
RCS file: /fetish/cu/src/csplit.c,v
retrieving revision 1.144
retrieving revision 1.145
diff -u -p -u -r1.144 -r1.145
--- src/csplit.c        9 Sep 2005 21:08:19 -0000       1.144
+++ src/csplit.c        10 Sep 2005 13:56:45 -0000      1.145
@@ -251,16 +251,12 @@ interrupt_handler (int sig)
 }
 
 /* Keep track of NUM bytes of a partial line in buffer START.
-   These bytes will be retrieved later when another large buffer is read.
-   It is not necessary to create a new buffer for these bytes; instead,
-   we keep a pointer to the existing buffer.  This buffer *is* on the
-   free list, and when the next buffer is obtained from this list
-   (even if it is this one), these bytes will be placed at the
-   start of the new buffer. */
+   These bytes will be retrieved later when another large buffer is read.  */
 
 static void
 save_to_hold_area (char *start, size_t num)
 {
+  free (hold_area);
   hold_area = start;
   hold_count = num;
 }
@@ -386,7 +382,7 @@ record_line_starts (struct buffer_record
          lines++;
        }
       else
-       save_to_hold_area (line_start, bytes_left);
+       save_to_hold_area (xmemdup (line_start, bytes_left), bytes_left);
     }
 
   b->num_lines = lines;
@@ -442,6 +438,7 @@ static void
 free_buffer (struct buffer_record *buf)
 {
   free (buf->buffer);
+  buf->buffer = NULL;
 }
 
 /* Append buffer BUF to the linked list of buffers that contain
@@ -495,7 +492,7 @@ load_buffer (void)
   if (bytes_wanted < hold_count)
     bytes_wanted = hold_count;
 
-  do
+  while (1)
     {
       b = get_new_buffer (bytes_wanted);
       bytes_avail = b->bytes_alloc; /* Size of buffer returned. */
@@ -504,8 +501,7 @@ load_buffer (void)
       /* First check the `holding' area for a partial line. */
       if (hold_count)
        {
-         if (p != hold_area)
-           memcpy (p, hold_area, hold_count);
+         memcpy (p, hold_area, hold_count);
          p += hold_count;
          b->bytes_used += hold_count;
          bytes_avail -= hold_count;
@@ -515,11 +511,18 @@ load_buffer (void)
       b->bytes_used += read_input (p, bytes_avail);
 
       lines_found = record_line_starts (b);
-      bytes_wanted = b->bytes_alloc * 2;
       if (!lines_found)
        free_buffer (b);
+
+      if (lines_found || have_read_eof)
+       break;
+
+      if (xalloc_oversized (2, b->bytes_alloc))
+       xalloc_die ();
+      bytes_wanted = 2 * b->bytes_alloc;
+      free_buffer (b);
+      free (b);
     }
-  while (!lines_found && !have_read_eof);
 
   if (lines_found)
     save_buffer (b);
Index: tests/misc/csplit
===================================================================
RCS file: /fetish/cu/tests/misc/csplit,v
retrieving revision 1.6
retrieving revision 1.7
diff -u -p -u -r1.6 -r1.7
--- tests/misc/csplit   23 Jun 2004 15:07:04 -0000      1.6
+++ tests/misc/csplit   10 Sep 2005 14:06:29 -0000      1.7
@@ -86,4 +86,13 @@ EOF
 cmp err experr || fail=1
 test $fail = 1 && diff err experr 2> /dev/null
 
+# Ensure that lines longer than the initial buffer length don't cause
+# trouble (e.g. reading from freed memory, resulting in corrupt output).
+# This test failed at least in coreutils-5.2.1 and 5.3.0, and was fixed
+# in 5.3.1.
+rm -f in out exp err experr xx??
+printf 'x%8199s\nx\n%8199s\nx\n' x x > in
+csplit in '/x/' '{*}' > /dev/null || fail=1
+cat xx?? | cmp - in || fail=1
+
 (exit $fail); exit $fail




reply via email to

[Prev in Thread] Current Thread [Next in Thread]