12-eregexp-2.patch

m4-patches
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
12-eregexp-2.patch

From:	Akim Demaille
Subject:	12-eregexp-2.patch
Date:	02 Oct 2001 16:16:55 +0200
User-agent:	Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Artificial Intelligence)
Here is my lost proposal for ERE macros, updated to
post-post-m4_token_data changes.

Index: ChangeLog
from  Akim Demaille  <address@hidden>

        * modules/gnu.c (m4_regexp_do, m4_patsubst_do): Extracted from
        previous builtins `regexp' and `patsubst'.
        (regexp, patsubst): Use them.
        (eregexp, epatsubst): New builtins.
        * doc/m4.texinfo (Patsubst, Regexp): Rename and complete as...
        (Epatsubst and Patsubst, Eregexp and Regexp): these.
        (Extensions): More info on REs.
        
Index: NEWS
--- NEWS Mon, 01 Oct 2001 16:22:19 +0200 akim
+++ NEWS Mon, 01 Oct 2001 16:27:32 +0200 akim
@@ -3,6 +3,8 @@
 
 Version beta 1.4q - August 2001, by Gary V. Vaughan
 
+* Support for the experimental `changeword' has been dropped.
+
 * `m4 --hashsize' and `-H' are still accepted, but have no effect.  M4
   will grow its internal symbol table if the symbol density is having
   an effect on performance.
@@ -13,9 +15,9 @@
   development and debugging of the base modules without the need to
   recompile all of m4 with each modification.
 
-* `configure --with-modules="gnu m4 traditional load changeword"', for
-  example, will build an m4 binary with the named modules preloaded, ready
-  to be activated (even on static lib only machines) with the `-m' option
+* `configure --with-modules="gnu m4 traditional load"', for example,
+  will build an m4 binary with the named modules preloaded, ready to
+  be activated (even on static lib only machines) with the `-m' option
   or using the `load' builtin.
 
 * M4 has no builtins or macros in core, they are all loaded from modules
@@ -30,11 +32,15 @@
 * New builtin `unload' to remove loaded modules (and the builtins and user
   macros they define) from the running m4 interpreter.
 
+* New builtins `eregexp' and `epatsubst' to use Extended Regular Expressions
+  syntax in lieu of Basic Regular Expressions as used by `regexp' and
+  `patsubst'.
+
 * The names of all currently loaded modules are returned by the new
   builtin, ``modules''.
- 
+
 * Loadable modules can define new builtin functions or text expansion
-  macros. 
+  macros.
 
 * The module code has been rewritten to use libltdl, the libtool dynamic
   loader, which means GNU m4 can now load (and unload) modules just about
@@ -49,7 +55,7 @@
   directory for examples of usage.
 
 * A new V2 format for frozen files that saves module and syntax information.
- 
+
 Version beta 1.4o - January 2000, by Rene' Seindal
 
 * Modules can be loaded from the command line with --load-module
@@ -138,7 +144,7 @@
   macros, ie, new macros can be written in C.  Depends on the dlopen()
   interface, and is currently only tested on Linux.  Enabled at configure
   time with `--with-modules'.  Documentation is in src/module.c and
-  module/README. 
+  module/README.
 
 * Implement a GNU message catalog for French (Franc,ois Pinard).
 
Index: doc/m4.texinfo
--- doc/m4.texinfo Mon, 01 Oct 2001 16:22:19 +0200 akim
+++ doc/m4.texinfo Mon, 01 Oct 2001 16:26:42 +0200 akim
@@ -9,6 +9,11 @@
 
 @set beta
 
address@hidden A simple macro for optional variables.
address@hidden ovar{varname}
address@hidden@address@hidden
address@hidden macro
+
 @dircategory Text Processing Tools
 @direntry
 * m4: (m4).                    A powerful macro processor.
@@ -202,10 +207,10 @@ subsequent changes by Fran@,{c}ois Pinar
 
 * Len::                         Calculating length of strings
 * Index::                       Searching for substrings
-* Regexp::                      Searching for regular expressions
+* Eregexp and Regexp::          Searching for regular expressions
 * Substr::                      Extracting substrings
 * Translit::                    Translating characters
-* Patsubst::                    Substituting text by regular expression
+* Epatsubst and Patsubst::      Substituting text by regular expression
 * Format::                      Formatting strings (printf-like)
 
 Macros for doing arithmetic
@@ -578,8 +583,8 @@ @node Manual
 The @samp{Builtin} declaration declares that this macro is implemented
 as an m4 builtin; any parenthesised word immediately following is the
 name of the module that must be loaded.  The standards modules include
address@hidden (which is always available), @samp{gnu} (for gnu specific m4
-extensions) and @samp{traditional} (for compatibility with System V
address@hidden (which is always available), @samp{gnu} (for @sc{gnu} specific
+m4 extensions) and @samp{traditional} (for compatibility with System V
 m4).
 
 @noindent
@@ -2808,10 +2813,10 @@ @node Text handling
 @menu
 * Len::                         Calculating length of strings
 * Index::                       Searching for substrings
-* Regexp::                      Searching for regular expressions
+* Eregexp and Regexp::          Searching for regular expressions
 * Substr::                      Extracting substrings
 * Translit::                    Translating characters
-* Patsubst::                    Substituting text by regular expression
+* Epatsubst and Patsubst::      Substituting text by regular expression
 * Format::                      Formatting strings (printf-like)
 @end menu
 
@@ -2854,29 +2859,29 @@ @node Index
 @result{}-1
 @end example
 
address@hidden Regexp
address@hidden Eregexp and Regexp
 @section Searching for regular expressions
 
 @cindex regular expressions
 @cindex GNU extensions
address@hidden {Builtin (gnu)} regexp (@var{string}, @var{regexp}, @w{opt 
@var{replacement})}
address@hidden {Builtin (gnu)} eregexp (@var{string}, @var{regexp}, @w{opt 
@var{replacement})}
 Searching for regular expressions is done with the builtin
 @code{regexp}, which searches for @var{regexp} in @var{string}.  The
-syntax for regular expressions is the same as in GNU Emacs.
address@hidden, , Syntax of Regular Expressions, emacs, The GNU Emacs
-Manual}.
+syntax for regular expressions is similar to that of GNU Awk, Perl,
+Egrep: so called ``Extended Regular Expression''.
 
 If @var{replacement} is omitted, @code{regexp} expands to the index of
-the first match of @var{regexp} in @var{string}.  If @var{regexp} does
-not match anywhere in @var{string}, it expands to -1.
+the first match of @var{regexp} in @var{string}.  If @var{replacement}
+is specified and matches, then it expands into @var{replacement}. If
address@hidden does not match anywhere in @var{string}, it expands to -1.
 
 The builtin macro @code{regexp} is recognized only when given arguments.
 @end deffn
 
 @example
-regexp(`GNUs not Unix', `\<[a-z]\w+')
+eregexp(`GNUs not Unix', `\<[a-z]\w+')
 @result{}5
-regexp(`GNUs not Unix', `\<Q\w*')
+eregexp(`GNUs not Unix', `\<Q\w*')
 @result{}-1
 @end example
 
@@ -2886,6 +2891,23 @@ @node Regexp
 @samp{\&} being the text the entire regular expression matched.
 
 @example
+eregexp(`GNUs not Unix', `\w(\w+)$', `*** \& *** \1 ***')
address@hidden Unix *** nix ***
address@hidden example
+
+Original regular expressions were much less powerful (basically only
address@hidden was available), and as new operators were implemented, to keep
+backward compatibility, they were mapped onto invalid sequences, such as
address@hidden(}.  The following macro is the exact peer of @code{eregexp}, but
+using this old and clumsy syntax.
+
address@hidden {Builtin (gnu)} regexp (@var{string}, @var{regexp}, @w{opt 
@var{replacement})}
+Same a @code{eregexp}, but using the old and clumsy ``Basic Regular
+Expression'' syntax, the same as in GNU Emacs.  @xref{Regexps, , Syntax
+of Regular Expressions, emacs, The GNU Emacs Manual}.
address@hidden deffn
+
address@hidden
 regexp(`GNUs not Unix', `\w\(\w+\)$', `*** \& *** \1 ***')
 @result{}*** Unix *** nix ***
 @end example
@@ -2953,7 +2975,7 @@ @node Translit
 while converting them to lowercase.  The two first cases are by far the
 most common.
 
address@hidden Patsubst
address@hidden Epatsubst and Patsubst
 @section Substituting text by regular expression
 
 @cindex regular expressions
@@ -2963,8 +2985,8 @@ @node Patsubst
 @deffn {Builtin (gnu)} patsubst (@var{string}, @var{regexp}, @w{opt 
@var{replacement})}
 Global substitution in a string is done by @code{patsubst}, which
 searches @var{string} for matches of @var{regexp}, and substitutes
address@hidden for each match.  The syntax for regular expressions is
-the same as in GNU Emacs.
address@hidden for each match.  It uses Extended Regular Expressions
+syntax.
 
 The parts of @var{string} that are not covered by any match of
 @var{regexp} are copied to the expansion.  Whenever a match is found, the
@@ -2981,6 +3003,66 @@ @node Patsubst
 The @var{replacement} argument can be omitted, in which case the text
 matched by @var{regexp} is deleted.
 
+The builtin macro @code{patsubst} is recognized only when given
+arguments.
address@hidden deffn
+
+When used with two arguments, while @code{eregexp} returns the position
+of the match, @code{epatsusbt} deletes it:
+
address@hidden
+epatsubst(`GNUs not Unix', `^', `OBS: ')
address@hidden: GNUs not Unix
+epatsubst(`GNUs not Unix', `\<', `OBS: ')
address@hidden: GNUs OBS: not OBS: Unix
+epatsubst(`GNUs not Unix', `\w*', `(\&)')
address@hidden(GNUs)() (not)() (Unix)
+epatsubst(`GNUs not Unix', `\w+', `(\&)')
address@hidden(GNUs) (not) (Unix)
+epatsubst(`GNUs not Unix', `[A-Z][a-z]+')
address@hidden not @comment
address@hidden example
+
+Here is a slightly more realistic example, which capitalizes individual
+word or whole sentences, by substituting calls of the macros
address@hidden and @code{downcase} into the strings.
+
address@hidden
+define(`upcase',   `translit(`$*', `a-z', `A-Z')')dnl
+define(`downcase', `translit(`$*', `A-Z', `a-z')')dnl
+define(`capitalize1',
+       `eregexp(`$1', `^(\w)(\w*)', `upcase(`\1')`'downcase(`\2')')')dnl
+define(`capitalize',
+       `epatsubst(`$1', `\w+', `capitalize1(`\&')')')dnl
+capitalize(`GNUs not Unix')
address@hidden Not Unix
address@hidden example
+
+While @code{eregexp} replaces the whole input with the replacement as
+soon as there is a match, @code{epatsubst} replaces each
address@hidden of a match and preserves non matching pieces:
+
address@hidden
+define(`patreg',
+`epatsubst($@@)
+eregexp($@@)')dnl
+patreg(`bar foo baz Foo', `foo|Foo', `FOO')
address@hidden FOO baz FOO
address@hidden
+patreg(`aba abb 121', `(.)(.)\1', `\2\1\2')
address@hidden abb 212
address@hidden
address@hidden example
+
+
address@hidden {Builtin (gnu)} patsubst (@var{string}, @var{regexp}, @w{opt 
@var{replacement})}
+Same as @code{epatsubst}, but using Basic Regular Expression syntax, see
address@hidden and Regexp}, for more details.
address@hidden deffn
+
address@hidden No longer interesting for the documentation per se, but good
address@hidden for testing.
address@hidden
 @example
 patsubst(`GNUs not Unix', `^', `OBS: ')
 @result{}OBS: GNUs not Unix
@@ -2994,24 +3076,17 @@ @node Patsubst
 @result{}GN not @comment
 @end example
 
-The builtin macro @code{patsubst} is recognized only when given
-arguments.
address@hidden deffn
-
-Here is a slightly more realistic example, which capitalizes individual
-word or whole sentences, by substituting calls of the macros
address@hidden and @code{downcase} into the strings.
-
 @example
-define(`upcase', `translit(`$*', `a-z', `A-Z')')dnl
+define(`upcase',   `translit(`$*', `a-z', `A-Z')')dnl
 define(`downcase', `translit(`$*', `A-Z', `a-z')')dnl
 define(`capitalize1',
-     `regexp(`$1', `^\(\w\)\(\w*\)', `upcase(`\1')`'downcase(`\2')')')dnl
+       `regexp(`$1', `^\(\w\)\(\w*\)', `upcase(`\1')`'downcase(`\2')')')dnl
 define(`capitalize',
-     `patsubst(`$1', `\w+', `capitalize1(`\&')')')dnl
+       `patsubst(`$1', `\w+', `capitalize1(`\&')')')dnl
 capitalize(`GNUs not Unix')
 @result{}Gnus Not Unix
 @end example
address@hidden ignore
 
 @node Format
 @section Formatted output
@@ -3684,9 +3759,17 @@ @node Extensions
 is modeled after the C library function @code{printf} (@pxref{Format}).
 
 @item
-Searches and text substitution through regular expressions are
-supported by the @code{regexp} (@pxref{Regexp}) and @code{patsubst}
-(@pxref{Patsubst}) builtins.
+Searches and text substitution through regular expressions are supported
+by the @code{eregexp}, @code{regexp} (@pxref{Eregexp and Regexp}) and
address@hidden, @code{patsubst} (@pxref{Epatsubst and Patsubst})
+builtins.
+
address@hidden
+The syntax for regular expression has never clearly formalized for M4.
+While Open BSD M4 uses extended regular expressions for @code{regexp}
+and @code{patsubst}, @sc{gnu} M4 uses basic regular expression.  Use
address@hidden (@pxref{Eregexp and Regexp}) and @code{epatsubst}
+(@pxref{Epatsubst and Patsubst}) for extended regular expressions.
 
 @item
 The output of shell commands can be read into @code{m4} with
Index: modules/gnu.c
--- modules/gnu.c Mon, 01 Oct 2001 16:22:19 +0200 akim
+++ modules/gnu.c Mon, 01 Oct 2001 16:26:42 +0200 akim
@@ -45,7 +45,9 @@
 #define RE_SYNTAX_ERE \
   (/* Allow char classes. */                                   \
     RE_CHAR_CLASSES                                            \
-  /* Be picky. */                                              \
+  /* Anchors are OK in groups. */                              \
+  | RE_CONTEXT_INDEP_ANCHORS                                   \
+  /* Be picky, `/^?/', for instance, makes no sense. */                \
   | RE_CONTEXT_INVALID_OPS                                     \
   /* Allow intervals with `{' and `}', forbid invalid ranges. */\
   | RE_INTERVALS | RE_NO_BK_BRACES | RE_NO_EMPTY_RANGES                \
@@ -73,6 +75,8 @@
        BUILTIN(changesyntax,   FALSE,  TRUE  ) \
        BUILTIN(debugmode,      FALSE,  FALSE ) \
        BUILTIN(debugfile,      FALSE,  FALSE ) \
+       BUILTIN(eregexp,        FALSE,  TRUE  ) \
+       BUILTIN(epatsubst,      FALSE,  TRUE  ) \
        BUILTIN(esyscmd,        FALSE,  TRUE  ) \
        BUILTIN(format,         FALSE,  TRUE  ) \
        BUILTIN(indir,          FALSE,  TRUE  ) \
@@ -304,11 +308,13 @@
 /**
  * regexp(STRING, REGEXP, [REPLACEMENT])
  **/
-M4BUILTIN_HANDLER (regexp)
+
+static void
+m4_regexp_do (struct obstack *obs, int argc, m4_symbol **argv,
+             int syntax)
 {
   const char *victim;          /* first argument */
   const char *regexp;          /* regular expression */
-  const char *repl;            /* replacement string */
 
   struct re_pattern_buffer *buf;/* compiled regular expression */
   struct re_registers regs;    /* for subexpression matches */
@@ -321,7 +327,7 @@
   victim = M4ARG (1);
   regexp = M4ARG (2);
 
-  buf = m4_regexp_compile (M4ARG(0), regexp, RE_SYNTAX_BRE);
+  buf = m4_regexp_compile (M4ARG(0), regexp, syntax);
   if (!buf)
     return;
 
@@ -331,21 +337,38 @@
   if (startpos  == -2)
     {
       M4ERROR ((warning_status, 0,
-               _("Error matching regular expression `%s'"), regexp));
+               _("%s: error matching regular expression `%s'"),
+               M4ARG (0), regexp));
       return;
     }
 
   if (argc == 3)
     m4_shipout_int (obs, startpos);
   else if (startpos >= 0)
-    {
-      repl = M4ARG (3);
-      substitute (obs, victim, repl, &regs);
-    }
+    substitute (obs, victim, M4ARG (3), &regs);
 
   return;
 }
 
+
+/**
+ * regexp(STRING, REGEXP, [REPLACEMENT])
+ **/
+M4BUILTIN_HANDLER (regexp)
+{
+  m4_regexp_do (obs, argc, argv, RE_SYNTAX_BRE);
+}
+
+/**
+ * regexp(STRING, REGEXP, [REPLACEMENT])
+ **/
+M4BUILTIN_HANDLER (eregexp)
+{
+  m4_regexp_do (obs, argc, argv, RE_SYNTAX_ERE);
+}
+
+
+
 /* Substitute all matches of a regexp occuring in a string.  Each match of
    the second argument (a regexp) in the first argument is changed to the
    third argument, with \& substituted by the matched text, and \N
@@ -354,7 +377,8 @@
 /**
  * patsubst(STRING, REGEXP, [REPLACEMENT])
  **/
-M4BUILTIN_HANDLER (patsubst)
+m4_patsubst_do (struct obstack *obs, int argc, m4_symbol **argv,
+               int syntax)
 {
   const char *victim;          /* first argument */
   const char *regexp;          /* regular expression */
@@ -372,7 +396,7 @@
   victim = M4ARG (1);
   length = strlen (victim);
 
-  buf = m4_regexp_compile (M4ARG(0), regexp, RE_SYNTAX_BRE);
+  buf = m4_regexp_compile (M4ARG(0), regexp, syntax);
   if (!buf)
     return;
 
@@ -418,6 +442,22 @@
   obstack_1grow (obs, '\0');
 
   return;
+}
+
+/**
+ * patsubst(STRING, REGEXP, [REPLACEMENT])
+ **/
+M4BUILTIN_HANDLER (patsubst)
+{
+  m4_patsubst_do (obs, argc, argv, RE_SYNTAX_BRE);
+}
+
+/**
+ * patsubst(STRING, REGEXP, [REPLACEMENT])
+ **/
+M4BUILTIN_HANDLER (epatsubst)
+{
+  m4_patsubst_do (obs, argc, argv, RE_SYNTAX_ERE);
 }
 
 /* Implementation of "symbols" itself.  It builds up a table of pointers to
[Prev in Thread]
Current Thread
[Next in Thread]
12-eregexp-2.patch, Akim Demaille <=
- Re: 12-eregexp-2.patch, Gary V. Vaughan, 2001/10/03
  - Re: 12-eregexp-2.patch, Akim Demaille, 2001/10/04
Prev by Date: Re: 01-ere-layout.patch
Next by Date: Re: 06-argc-mismatches.patch
Previous by thread: Re: 01-ere-layout.patch
Next by thread: Re: 12-eregexp-2.patch
Index(es):
- Date
- Thread