[PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches

From:	Stanislav Brabec
Subject:	[PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches
Date:	Fri, 10 Feb 2012 20:39:37 +0100

sed may hang in an infinite loop on some strings in charsets that form a
false match on a boundary of two characters (e. g. EUC-JP) replacements.

The execute.c: str_append() makes bad assumption:

if (n > 0)

n is declared as size_t. And on glibc in Linux x86_64, size_t seems to
be defined as unsigned long. The code apparently does not expect that it
is true for -2 as well.

Further bug is a bit deeper, and may require fix of the same code in
glibc, depending on configure parameters. See
http://sourceware.org/bugzilla/show_bug.cgi?id=13637 for glibc fix.

re_search_internal() inside switch(match_kind) in case 6 finds a
possible match. In case of our false match, verification of match not
respecting multi-byte characters fails and match_regex() returns index
of such false match.

Going deeper, re_search_internal() calls re_string_reconstruct() and
that calls re_string_skip_chars().

re_string_skip_chars() is a I18N specific function that jumps by
characters up to the indexed character. It is a multi-byte character
wise function.

In case of correct run, it returns correct index to the next character
to inspect. In case of bug occurrence, __mbrtowc called from there
returns -2 (incomplete multi-byte character). Why? It seems to be caused
by remain_len being equal 1, even if there is still 6 bytes to inspect
("\267\357a\277\267\275").

I believe, that remain_len is computed incorrectly:

sed-4.2.1/lib/regex_internal.c:502 re_string_skip_chars()

      remain_len = pstr->len - rawbuf_idx;

pstr->len seems to be length of the remaining part of the string,
rawbuf_idx is the index of the remaining part of the string in the
original (raw) string.

I am not quite familiar with the code, but I believe that the expression
should be:
remain_len = pstr->raw_len - rawbuf_idx;


Example:

stop in the first iteration of the re_string_skip_chars()

Correct case (two leading "a" characters):
rawbuf_idx = 5
*pstr = {
  raw_mbs = 0x6479b0 "aa\267\357a\277\267\275", <incomplete sequence \350>, mbs 
= 0x6479b2 "\267\357a\277\267\275", <incomplete sequence \350>, 
  wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
      __wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 2, 
  valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2, 
  raw_len = 9, len = 7, raw_stop = 9, stop = 7, tip_context = 0, 
  trans = 0x0, word_char = 0x647d88, icase = 0 '\000', 
  is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000', 
  offsets_needed = 0 '\000', newline_anchor = 0 '\000', 
  word_ops_used = 0 '\000', mb_cur_max = 3}

Buggy case (three leading "a" characters):
rawbuf_idx = 6
*pstr = {
  raw_mbs = 0x6479b0 "aaa\267\357a\277\267\275", <incomplete sequence \350>, 
mbs = 0x6479b3 "\267\357a\277\267\275", <incomplete sequence \350>, 
  wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
      __wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 3, 
  valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2, 
  raw_len = 10, len = 7, raw_stop = 10, stop = 7, tip_context = 0, 
  trans = 0x0, word_char = 0x647d88, icase = 0 '\000', 
  is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000', 
  offsets_needed = 0 '\000', newline_anchor = 0 '\000', 
  word_ops_used = 0 '\000', mb_cur_max = 3}


If my observation is correct, the bug is not EUC-JP specific.

Bug triggers:
- Charset must be capable to constitute false match on the boundary of
  two characters. EUC-JP fits this requirement, UTF-8 probably does not.
- There is a true ASCII match that is false match in locale specific
  charset.
- This false match must appear in an exact place near two thirds of the
  string.

Index: sed-4.2.1/sed/execute.c
===================================================================
--- sed-4.2.1.orig/sed/execute.c
+++ sed-4.2.1/sed/execute.c
@@ -261,7 +261,7 @@ str_append(to, string, length)
            n = 1;
          }
 
-        if (n > 0)
+        if ((n != (size_t) -2) && (n > 0))
          {
            string += n;
            length -= n;
Index: sed-4.2.1/lib/regex_internal.c
===================================================================
--- sed-4.2.1.orig/lib/regex_internal.c
+++ sed-4.2.1/lib/regex_internal.c
@@ -499,7 +499,7 @@ re_string_skip_chars (re_string_t *pstr,
     {
       wchar_t wc2;
       Idx remain_len;
-      remain_len = pstr->len - rawbuf_idx;
+      remain_len = pstr->raw_len - rawbuf_idx;
       prev_st = pstr->cur_state;
       mbclen = __mbrtowc (&wc2, (const char *) pstr->raw_mbs + rawbuf_idx,
                          remain_len, &pstr->cur_state);


-- 
Best Regards / S pozdravem,

Stanislav Brabec
software developer
---------------------------------------------------------------------
SUSE LINUX, s. r. o.                          e-mail: address@hidden
Lihovarská 1060/12                            tel: +49 911 7405384547
190 00 Praha 9                                  fax: +420 284 028 951
Czech Republic                                    http://www.suse.cz/

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches, Stanislav Brabec <=
- Re: [PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches, Paolo Bonzini, 2012/02/10
  - Re: [PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches, Paul Eggert, 2012/02/10
- Re: [PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches, Paolo Bonzini, 2012/02/27

Prev by Date: Re: grep.... I know I am new to ubuntu but....
Next by Date: [PATCH 2/3] sed: testsuite for infinite loop in some EUC-JP replacements
Previous by thread: Where should I report bugs?
Next by thread: Re: [PATCH 1/3] sed: Fix infinite loop on some false multi-byte matches
Index(es):
- Date
- Thread