[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gawk: Wrong behavior in binary mode
From: |
Aharon Robbins |
Subject: |
Re: gawk: Wrong behavior in binary mode |
Date: |
Thu, 18 Dec 2008 05:15:30 +0200 |
Hi All.
Let's be clear here as to what the issues are and what our goals are.
1. In a multibyte locale (e.g., en_US.utf8), when handed bytes that do
not make a valid multibye string, gawk produces incorrect results for
length() and index(). This is a bug. The patch I sent out earlier
fixes this bug. I am pretty sure I pushed it out to the Savannah CVS
gawk-stable tree but if I didn't, I will shortly.
This fixes the original problem as reported by address@hidden
and fulfills my "obligations" as gawk maintainer. (:-)
2. Eli, not unreasonably, claims that there are times when a user really
wants "hands off my data" and that gawk (and other utilities) should
provide a mechanism for this.
However, there is no formal "requirement" for such a feature, in that
there is no standard that states this, nor is there is existing practice
for such a feature in other versions of awk. So, we have moved into
the realm of "nice to have".
3. I initially answered that LC_ALL=C meets this requirement. This is
- Immediately available
- Standard
- Portable
Technically, I could stop here.
4. But, as Eli correctly points out, this is a big hammer, with other
undersirable effects if all a user wants is "hands off my data".
Which leads us to the patch below, which provides minimal changes for
a new option that should do the trick.
(See what a nice guy I am? :-)
This will make its way to the development CVS shortly.
Enjoy,
Arnold
--------------------------------------------------
Index: ChangeLog
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/ChangeLog,v
retrieving revision 1.3
diff -u -r1.3 ChangeLog
--- ChangeLog 16 Nov 2008 20:05:33 -0000 1.3
+++ ChangeLog 18 Dec 2008 03:11:50 -0000
@@ -1,3 +1,10 @@
+Wed Dec 17 09:54:00 2008 Arnold D. Robbins <address@hidden>
+
+ * main.c (do_binary): New variable for new option -b which
+ makes gawk not mess with multibyte strings.
+ (opttab): Add option entry for -b / --binary.
+ (main): If do_binary, set gawk_mb_cur_max to 1.
+
Sat Oct 27 22:43:50 2007 Arnold D. Robbins <address@hidden>
* re.c (resetup): Add RE_INVALID_INTERVAL_ORD to syntax bits if
Index: main.c
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/main.c,v
retrieving revision 1.2
diff -u -r1.2 main.c
--- main.c 16 Nov 2008 19:23:56 -0000 1.2
+++ main.c 18 Dec 2008 03:12:48 -0000
@@ -137,6 +137,7 @@
int do_profiling = FALSE; /* profile and pretty print the program */
int do_dump_vars = FALSE; /* dump all global variables at end */
int do_tidy_mem = FALSE; /* release vars when done */
+int do_binary = FALSE; /* hands off my data! */
int in_begin_rule = FALSE; /* we're in a BEGIN rule */
int in_end_rule = FALSE; /* we're in an END rule */
@@ -193,6 +194,7 @@
{ "help", no_argument, NULL, 'u' },
{ "exec", required_argument, NULL, 'S' },
{ "use-lc-numeric", no_argument, & use_lc_numeric, 1 },
+ { "binary", no_argument, & do_binary, 'b' },
#if defined(YYDEBUG) || defined(GAWKDEBUG)
{ "parsedebug", no_argument, NULL, 'D' },
#endif
@@ -212,7 +214,7 @@
int c;
char *scan;
/* the + on the front tells GNU getopt not to rearrange argv */
- const char *optlist = "+F:f:v:W;m:rD";
+ const char *optlist = "+F:f:v:W;m:rDb";
int stopped_early = FALSE;
int old_optind;
extern int optind;
@@ -370,6 +372,10 @@
do_intervals = TRUE;
break;
+ case 'b':
+ do_binary = TRUE;
+ break;
+
case 'W': /* gawk specific options - now in getopt_long */
fprintf(stderr, _("%s: option `-W %s' unrecognized,
ignored\n"),
argv[0], optarg);
@@ -504,6 +510,15 @@
if (do_lint && os_is_setuid())
warning(_("running %s setuid root may be a security problem"),
myname);
+#ifdef MBS_SUPPORT
+ if (do_binary) {
+ if (do_posix)
+ warning(_("`--posix' overrides `--binary'"));
+ else
+ gawk_mb_cur_max = 1; /* hands off my data! */
+ }
+#endif
+
/*
* Force profiling if this is pgawk.
* Don't bother if the command line already set profiling up.
Index: doc/awkcard.in
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/doc/awkcard.in,v
retrieving revision 1.3
diff -u -r1.3 awkcard.in
--- doc/awkcard.in 16 Nov 2008 20:05:45 -0000 1.3
+++ doc/awkcard.in 18 Dec 2008 03:11:02 -0000
@@ -247,6 +247,12 @@
expand, tab(%);
ls
l lw(2.2i).
+\*(FC\-b\*(FR, \*(FC\-\^\-binary\*(FR%T{
+Treat all input data as single-byte characters. I.e.,
+don't attempt to
+process strings as multibyte characters.
+Overridden by \*(FC\-\^\-posix\*(FR.
+T}
\*(FC\-\^\-compat\*(FR, \*(FC\-\^\-traditional\*(FR
%T{
disable \*(GK-specific extensions
Index: doc/gawk.1
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/doc/gawk.1,v
retrieving revision 1.3
diff -u -r1.3 gawk.1
--- doc/gawk.1 16 Nov 2008 20:05:45 -0000 1.3
+++ doc/gawk.1 18 Dec 2008 03:11:05 -0000
@@ -221,6 +221,18 @@
scripts.
.TP
.PD 0
+.B \-b
+.TP
+.PD
+.B \-\^\-binary
+Treat all input data as single-byte characters. In other words,
+don't pay any attention to the locale information when attempting to
+process strings as multibyte characters.
+The
+.B "\-\^\-posix"
+option overrides this one.
+.TP
+.PD 0
.B "\-W compat"
.TP
.PD 0
Index: doc/gawk.texi
===================================================================
RCS file: /d/mongo/cvsrep/gawk-devel/doc/gawk.texi,v
retrieving revision 1.3
diff -u -r1.3 gawk.texi
--- doc/gawk.texi 16 Nov 2008 20:05:45 -0000 1.3
+++ doc/gawk.texi 18 Dec 2008 03:11:08 -0000
@@ -16549,6 +16549,18 @@
The following list describes @command{gawk}-specific options:
@table @code
address@hidden -b
address@hidden --binary
address@hidden @code{-b} option
address@hidden @code{--binary} option
+Causes @command{gawk} to treat all input data as single-byte characters.
+Normally, @command{gawk} follows the POSIX standard and attempts to process
+its input data according to the current locale. This can often involve
+converting multi-byte characters into wide characters (internally), and
+can lead to problems or confusion if the input data does not contain valid
+multi-byte characters. This option is an easy way to tell @command{gawk}:
+``hands off my data!''.
+
@item -W compat
@itemx -W traditional
@itemx --compat