bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk: Wrong behavior in binary mode


From: Aharon Robbins
Subject: Re: gawk: Wrong behavior in binary mode
Date: Thu, 18 Dec 2008 05:15:30 +0200

Hi All.

Let's be clear here as to what the issues are and what our goals are.

1. In a multibyte locale (e.g., en_US.utf8), when handed bytes that do
not make a valid multibye string, gawk produces incorrect results for
length() and index().  This is a bug.  The patch I sent out earlier
fixes this bug.  I am pretty sure I pushed it out to the Savannah CVS
gawk-stable tree but if I didn't, I will shortly.

This fixes the original problem as reported by address@hidden
and fulfills my "obligations" as gawk maintainer. (:-)

2. Eli, not unreasonably, claims that there are times when a user really
wants "hands off my data" and that gawk (and other utilities) should
provide a mechanism for this.

However, there is no formal "requirement" for such a feature, in that
there is no standard that states this, nor is there is existing practice
for such a feature in other versions of awk.  So, we have moved into
the realm of "nice to have".

3. I initially answered that LC_ALL=C meets this requirement. This is

        - Immediately available
        - Standard
        - Portable

Technically, I could stop here.

4. But, as Eli correctly points out, this is a big hammer, with other
undersirable effects if all a user wants is "hands off my data".
Which leads us to the patch below, which provides minimal changes for
a new option that should do the trick.

(See what a nice guy I am? :-)

This will make its way to the development CVS shortly.

Enjoy,

Arnold
--------------------------------------------------
Index: ChangeLog
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/ChangeLog,v
retrieving revision 1.3
diff -u -r1.3 ChangeLog
--- ChangeLog   16 Nov 2008 20:05:33 -0000      1.3
+++ ChangeLog   18 Dec 2008 03:11:50 -0000
@@ -1,3 +1,10 @@
+Wed Dec 17 09:54:00 2008  Arnold D. Robbins  <address@hidden>
+
+       * main.c (do_binary): New variable for new option -b which
+       makes gawk not mess with multibyte strings.
+       (opttab): Add option entry for -b / --binary.
+       (main): If do_binary, set gawk_mb_cur_max to 1.
+
 Sat Oct 27 22:43:50 2007  Arnold D. Robbins  <address@hidden>
 
        * re.c (resetup): Add RE_INVALID_INTERVAL_ORD to syntax bits if
Index: main.c
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/main.c,v
retrieving revision 1.2
diff -u -r1.2 main.c
--- main.c      16 Nov 2008 19:23:56 -0000      1.2
+++ main.c      18 Dec 2008 03:12:48 -0000
@@ -137,6 +137,7 @@
 int do_profiling = FALSE;      /* profile and pretty print the program */
 int do_dump_vars = FALSE;      /* dump all global variables at end */
 int do_tidy_mem = FALSE;       /* release vars when done */
+int do_binary = FALSE;         /* hands off my data! */
 
 int in_begin_rule = FALSE;     /* we're in a BEGIN rule */
 int in_end_rule = FALSE;       /* we're in an END rule */
@@ -193,6 +194,7 @@
        { "help",               no_argument,            NULL,           'u' },
        { "exec",               required_argument,      NULL,           'S' },
        { "use-lc-numeric",     no_argument,            & use_lc_numeric, 1 },
+       { "binary",             no_argument,            & do_binary,     'b' },
 #if defined(YYDEBUG) || defined(GAWKDEBUG)
        { "parsedebug",         no_argument,            NULL,           'D' },
 #endif
@@ -212,7 +214,7 @@
        int c;
        char *scan;
        /* the + on the front tells GNU getopt not to rearrange argv */
-       const char *optlist = "+F:f:v:W;m:rD";
+       const char *optlist = "+F:f:v:W;m:rDb";
        int stopped_early = FALSE;
        int old_optind;
        extern int optind;
@@ -370,6 +372,10 @@
                        do_intervals = TRUE;
                        break;
 
+               case 'b':
+                       do_binary = TRUE;
+                       break;
+
                case 'W':       /* gawk specific options - now in getopt_long */
                        fprintf(stderr, _("%s: option `-W %s' unrecognized, 
ignored\n"),
                                argv[0], optarg);
@@ -504,6 +510,15 @@
        if (do_lint && os_is_setuid())
                warning(_("running %s setuid root may be a security problem"), 
myname);
 
+#ifdef MBS_SUPPORT
+       if (do_binary) {
+               if (do_posix)
+                       warning(_("`--posix' overrides `--binary'"));
+               else
+                       gawk_mb_cur_max = 1;    /* hands off my data! */
+       }
+#endif
+
        /*
         * Force profiling if this is pgawk.
         * Don't bother if the command line already set profiling up.
Index: doc/awkcard.in
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/doc/awkcard.in,v
retrieving revision 1.3
diff -u -r1.3 awkcard.in
--- doc/awkcard.in      16 Nov 2008 20:05:45 -0000      1.3
+++ doc/awkcard.in      18 Dec 2008 03:11:02 -0000
@@ -247,6 +247,12 @@
 expand, tab(%);
 ls
 l lw(2.2i).
+\*(FC\-b\*(FR, \*(FC\-\^\-binary\*(FR%T{
+Treat all input data as single-byte characters. I.e.,
+don't attempt to
+process strings as multibyte characters.
+Overridden by \*(FC\-\^\-posix\*(FR.
+T}
 \*(FC\-\^\-compat\*(FR, \*(FC\-\^\-traditional\*(FR
 %T{
 disable \*(GK-specific extensions
Index: doc/gawk.1
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/doc/gawk.1,v
retrieving revision 1.3
diff -u -r1.3 gawk.1
--- doc/gawk.1  16 Nov 2008 20:05:45 -0000      1.3
+++ doc/gawk.1  18 Dec 2008 03:11:05 -0000
@@ -221,6 +221,18 @@
 scripts.
 .TP
 .PD 0
+.B \-b
+.TP
+.PD
+.B \-\^\-binary
+Treat all input data as single-byte characters. In other words,
+don't pay any attention to the locale information when attempting to
+process strings as multibyte characters.
+The
+.B "\-\^\-posix"
+option overrides this one.
+.TP
+.PD 0
 .B "\-W compat"
 .TP
 .PD 0
Index: doc/gawk.texi
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/doc/gawk.texi,v
retrieving revision 1.3
diff -u -r1.3 gawk.texi
--- doc/gawk.texi       16 Nov 2008 20:05:45 -0000      1.3
+++ doc/gawk.texi       18 Dec 2008 03:11:08 -0000
@@ -16549,6 +16549,18 @@
 The following list describes @command{gawk}-specific options:
 
 @table @code
address@hidden -b
address@hidden --binary
address@hidden @code{-b} option
address@hidden @code{--binary} option
+Causes @command{gawk} to treat all input data as single-byte characters.
+Normally, @command{gawk} follows the POSIX standard and attempts to process
+its input data according to the current locale. This can often involve
+converting multi-byte characters into wide characters (internally), and
+can lead to problems or confusion if the input data does not contain valid
+multi-byte characters. This option is an easy way to tell @command{gawk}:
+``hands off my data!''.
+
 @item -W compat
 @itemx -W traditional
 @itemx --compat




reply via email to

[Prev in Thread] Current Thread [Next in Thread]