bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug introduced in gawk 3.1.1, still in 3.1.3


From: Aharon Robbins
Subject: Re: bug introduced in gawk 3.1.1, still in 3.1.3
Date: Thu, 29 Jan 2004 17:15:02 +0200

Greetings.  Re this:

> To: address@hidden
> Subject: bug introduced in gawk 3.1.1, still in 3.1.3
> Date: Wed, 28 Jan 2004 16:23:30 -0700 (MST)
> From: address@hidden (Bill Bruno)
>
> Here's the bug:
>
> cpg[95]% ./gawk '{sub(/[a-z]/,"&"); print}'
> aaa
> &aa
>
> I get this in 3.1.1 and 3.1.3.  In 3.1.0 I get the correct
> behavior:
>
> motif[5]% gawk '{sub ( /[a-z]/, "&"); print}'
> aaa
> aaa
>
> I find the same problem in gsub, where it is more relevant
> because this command can be used to count occurences of
> a regexp without changing the string.  If there is a more
> standard way to do that, please tell me.
>
> I guess the work around is to duplicate the string first...
> Bill

As I said in my earlier mail, this is related to the locale in use.
With LC_ALL=C, it doesn't happen.  The fix is included below.  For free,
you get a bonus bug fix: with --posix gawk will now follow the 2001
POSIX standard for sub and gsub.  Thank you for shopping at gnu.org. (:-)

Enjoy.

Arnold
-------------------------------------------
Thu Jan 29 17:04:51 2004  Arnold D. Robbins  <address@hidden>

        * builtin.c (sub_common): Fix logic for `&' in replacement for
        multibyte case.  Simplify code a bit.

Sun Jan 18 12:01:29 2004  Arnold D. Robbins  <address@hidden>

        * builtin.c (sub_common): Add comment and support for 2001 POSIX
        behavior when --posix in effect.

--- ../gawk-3.1.3/builtin.c     2003-07-07 01:08:08.000000000 +0300
+++ builtin.c   2004-01-29 17:04:28.000000000 +0200
@@ -1956,6 +2001,33 @@
  */
 
 /*
+ * 1/2004:  The gawk sub/gsub behavior dates from 1996, when we proposed it
+ * for POSIX.  The proposal fell through the cracks, and the 2001 POSIX
+ * standard chose a more simple behavior.
+ *
+ * The relevant text is to be found on lines 6394-6407 (pages 166, 167) of the
+ * 2001 standard:
+ * 
+ * sub(ere, repl[, in ])
+ *     Substitute the string repl in place of the first instance of the 
extended regular
+ *     expression ERE in string in and return the number of substitutions. An 
ampersand
+ *     ('&') appearing in the string repl shall be replaced by the string from 
in that
+ *     matches the ERE. An ampersand preceded with a backslash ('\') shall be
+ *     interpreted as the literal ampersand character. An occurrence of two 
consecutive
+ *     backslashes shall be interpreted as just a single literal backslash 
character. Any
+ *     other occurrence of a backslash (for example, preceding any other 
character) shall
+ *     be treated as a literal backslash character. Note that if repl is a 
string literal (the
+ *     lexical token STRING; see Grammar (on page 170)), the handling of the
+ *     ampersand character occurs after any lexical processing, including any 
lexical
+ *     backslash escape sequence processing. If in is specified and it is not 
an lvalue (see
+ *     Expressions in awk (on page 156)), the behavior is undefined. If in is 
omitted, awk
+ *     shall use the current record ($0) in its place.
+ *
+ * Because gawk has had its behavior for 7+ years, that behavior is remaining 
as
+ * the default, with the POSIX behavior available for do_posix. Fun, fun, fun.
+ */
+
+/*
  * NB: `howmany' conflicts with a SunOS 4.x macro in <sys/param.h>.
  */
 
@@ -2068,7 +2140,15 @@
                                        repllen--;
                                        scan++;
                                }
-                       } else {        /* (proposed) posix '96 mode */
+                       } else if (do_posix) {
+                               /* \& --> &, \\ --> \ */
+                               if (scan[1] == '&' || scan[1] == '\\') {
+                                       repllen--;
+                                       scan++;
+                               } /* else
+                                       leave alone, it goes into the output */
+                       } else {
+                               /* gawk default behavior since 1996 */
                                if (strncmp(scan, "\\\\\\&", 4) == 0) {
                                        /* \\\& --> \& */
                                        repllen -= 2;
@@ -2130,22 +2210,24 @@
                         * making substitutions as we go.
                         */
                        for (scan = repl; scan < replend; scan++)
+                               if (*scan == '&'
 #ifdef MBS_SUPPORT
-                               if ((gawk_mb_cur_max == 1
-                                        || (repllen > 0 && mb_indices[scan - 
repl] == 1))
-                                       && (*scan == '&'))
-#else
-                               if (*scan == '&')
+                                   /*
+                                    * Don't test repllen here. A simple "&" 
could
+                                    * end up with repllen == 0.
+                                    */
+                                   && (gawk_mb_cur_max == 1
+                                        || mb_indices[scan - repl] == 1)
 #endif
+                               ) {
                                        for (cp = matchstart; cp < matchend; 
cp++)
                                                *bp++ = *cp;
+                               } else if (*scan == '\\'
 #ifdef MBS_SUPPORT
-                               else if ((gawk_mb_cur_max == 1
+                                   && (gawk_mb_cur_max == 1
                                         || (repllen > 0 && mb_indices[scan - 
repl] == 1))
-                                                && (*scan == '\\')) {
-#else
-                               else if (*scan == '\\') {
 #endif
+                               ) {
                                        if (backdigs) { /* gensub, behave 
sanely */
                                                if (ISDIGIT(scan[1])) {
                                                        int dig = scan[1] - '0';
@@ -2161,7 +2243,13 @@
                                                        scan++;
                                                } else  /* \q for any q --> q */
                                                        *bp++ = *++scan;
-                                       } else {        /* posix '96 mode, 
bleah */
+                                       } else if (do_posix) {
+                                               /* \& --> &, \\ --> \ */
+                                               if (scan[1] == '&' || scan[1] 
== '\\')
+                                                       scan++;
+                                               *bp++ = *scan;
+                                       } else {
+                                               /* gawk default behavior since 
1996 */
                                                if (strncmp(scan, "\\\\\\&", 4) 
== 0) {
                                                        /* \\\& --> \& */
                                                        *bp++ = '\\';




reply via email to

[Prev in Thread] Current Thread [Next in Thread]