[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
12-eregexp-2.patch
From: |
Akim Demaille |
Subject: |
12-eregexp-2.patch |
Date: |
02 Oct 2001 16:16:55 +0200 |
User-agent: |
Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Artificial Intelligence) |
Here is my lost proposal for ERE macros, updated to
post-post-m4_token_data changes.
Index: ChangeLog
from Akim Demaille <address@hidden>
* modules/gnu.c (m4_regexp_do, m4_patsubst_do): Extracted from
previous builtins `regexp' and `patsubst'.
(regexp, patsubst): Use them.
(eregexp, epatsubst): New builtins.
* doc/m4.texinfo (Patsubst, Regexp): Rename and complete as...
(Epatsubst and Patsubst, Eregexp and Regexp): these.
(Extensions): More info on REs.
Index: NEWS
--- NEWS Mon, 01 Oct 2001 16:22:19 +0200 akim
+++ NEWS Mon, 01 Oct 2001 16:27:32 +0200 akim
@@ -3,6 +3,8 @@
Version beta 1.4q - August 2001, by Gary V. Vaughan
+* Support for the experimental `changeword' has been dropped.
+
* `m4 --hashsize' and `-H' are still accepted, but have no effect. M4
will grow its internal symbol table if the symbol density is having
an effect on performance.
@@ -13,9 +15,9 @@
development and debugging of the base modules without the need to
recompile all of m4 with each modification.
-* `configure --with-modules="gnu m4 traditional load changeword"', for
- example, will build an m4 binary with the named modules preloaded, ready
- to be activated (even on static lib only machines) with the `-m' option
+* `configure --with-modules="gnu m4 traditional load"', for example,
+ will build an m4 binary with the named modules preloaded, ready to
+ be activated (even on static lib only machines) with the `-m' option
or using the `load' builtin.
* M4 has no builtins or macros in core, they are all loaded from modules
@@ -30,11 +32,15 @@
* New builtin `unload' to remove loaded modules (and the builtins and user
macros they define) from the running m4 interpreter.
+* New builtins `eregexp' and `epatsubst' to use Extended Regular Expressions
+ syntax in lieu of Basic Regular Expressions as used by `regexp' and
+ `patsubst'.
+
* The names of all currently loaded modules are returned by the new
builtin, ``modules''.
-
+
* Loadable modules can define new builtin functions or text expansion
- macros.
+ macros.
* The module code has been rewritten to use libltdl, the libtool dynamic
loader, which means GNU m4 can now load (and unload) modules just about
@@ -49,7 +55,7 @@
directory for examples of usage.
* A new V2 format for frozen files that saves module and syntax information.
-
+
Version beta 1.4o - January 2000, by Rene' Seindal
* Modules can be loaded from the command line with --load-module
@@ -138,7 +144,7 @@
macros, ie, new macros can be written in C. Depends on the dlopen()
interface, and is currently only tested on Linux. Enabled at configure
time with `--with-modules'. Documentation is in src/module.c and
- module/README.
+ module/README.
* Implement a GNU message catalog for French (Franc,ois Pinard).
Index: doc/m4.texinfo
--- doc/m4.texinfo Mon, 01 Oct 2001 16:22:19 +0200 akim
+++ doc/m4.texinfo Mon, 01 Oct 2001 16:26:42 +0200 akim
@@ -9,6 +9,11 @@
@set beta
address@hidden A simple macro for optional variables.
address@hidden ovar{varname}
address@hidden@address@hidden
address@hidden macro
+
@dircategory Text Processing Tools
@direntry
* m4: (m4). A powerful macro processor.
@@ -202,10 +207,10 @@ subsequent changes by Fran@,{c}ois Pinar
* Len:: Calculating length of strings
* Index:: Searching for substrings
-* Regexp:: Searching for regular expressions
+* Eregexp and Regexp:: Searching for regular expressions
* Substr:: Extracting substrings
* Translit:: Translating characters
-* Patsubst:: Substituting text by regular expression
+* Epatsubst and Patsubst:: Substituting text by regular expression
* Format:: Formatting strings (printf-like)
Macros for doing arithmetic
@@ -578,8 +583,8 @@ @node Manual
The @samp{Builtin} declaration declares that this macro is implemented
as an m4 builtin; any parenthesised word immediately following is the
name of the module that must be loaded. The standards modules include
address@hidden (which is always available), @samp{gnu} (for gnu specific m4
-extensions) and @samp{traditional} (for compatibility with System V
address@hidden (which is always available), @samp{gnu} (for @sc{gnu} specific
+m4 extensions) and @samp{traditional} (for compatibility with System V
m4).
@noindent
@@ -2808,10 +2813,10 @@ @node Text handling
@menu
* Len:: Calculating length of strings
* Index:: Searching for substrings
-* Regexp:: Searching for regular expressions
+* Eregexp and Regexp:: Searching for regular expressions
* Substr:: Extracting substrings
* Translit:: Translating characters
-* Patsubst:: Substituting text by regular expression
+* Epatsubst and Patsubst:: Substituting text by regular expression
* Format:: Formatting strings (printf-like)
@end menu
@@ -2854,29 +2859,29 @@ @node Index
@result{}-1
@end example
address@hidden Regexp
address@hidden Eregexp and Regexp
@section Searching for regular expressions
@cindex regular expressions
@cindex GNU extensions
address@hidden {Builtin (gnu)} regexp (@var{string}, @var{regexp}, @w{opt
@var{replacement})}
address@hidden {Builtin (gnu)} eregexp (@var{string}, @var{regexp}, @w{opt
@var{replacement})}
Searching for regular expressions is done with the builtin
@code{regexp}, which searches for @var{regexp} in @var{string}. The
-syntax for regular expressions is the same as in GNU Emacs.
address@hidden, , Syntax of Regular Expressions, emacs, The GNU Emacs
-Manual}.
+syntax for regular expressions is similar to that of GNU Awk, Perl,
+Egrep: so called ``Extended Regular Expression''.
If @var{replacement} is omitted, @code{regexp} expands to the index of
-the first match of @var{regexp} in @var{string}. If @var{regexp} does
-not match anywhere in @var{string}, it expands to -1.
+the first match of @var{regexp} in @var{string}. If @var{replacement}
+is specified and matches, then it expands into @var{replacement}. If
address@hidden does not match anywhere in @var{string}, it expands to -1.
The builtin macro @code{regexp} is recognized only when given arguments.
@end deffn
@example
-regexp(`GNUs not Unix', `\<[a-z]\w+')
+eregexp(`GNUs not Unix', `\<[a-z]\w+')
@result{}5
-regexp(`GNUs not Unix', `\<Q\w*')
+eregexp(`GNUs not Unix', `\<Q\w*')
@result{}-1
@end example
@@ -2886,6 +2891,23 @@ @node Regexp
@samp{\&} being the text the entire regular expression matched.
@example
+eregexp(`GNUs not Unix', `\w(\w+)$', `*** \& *** \1 ***')
address@hidden Unix *** nix ***
address@hidden example
+
+Original regular expressions were much less powerful (basically only
address@hidden was available), and as new operators were implemented, to keep
+backward compatibility, they were mapped onto invalid sequences, such as
address@hidden(}. The following macro is the exact peer of @code{eregexp}, but
+using this old and clumsy syntax.
+
address@hidden {Builtin (gnu)} regexp (@var{string}, @var{regexp}, @w{opt
@var{replacement})}
+Same a @code{eregexp}, but using the old and clumsy ``Basic Regular
+Expression'' syntax, the same as in GNU Emacs. @xref{Regexps, , Syntax
+of Regular Expressions, emacs, The GNU Emacs Manual}.
address@hidden deffn
+
address@hidden
regexp(`GNUs not Unix', `\w\(\w+\)$', `*** \& *** \1 ***')
@result{}*** Unix *** nix ***
@end example
@@ -2953,7 +2975,7 @@ @node Translit
while converting them to lowercase. The two first cases are by far the
most common.
address@hidden Patsubst
address@hidden Epatsubst and Patsubst
@section Substituting text by regular expression
@cindex regular expressions
@@ -2963,8 +2985,8 @@ @node Patsubst
@deffn {Builtin (gnu)} patsubst (@var{string}, @var{regexp}, @w{opt
@var{replacement})}
Global substitution in a string is done by @code{patsubst}, which
searches @var{string} for matches of @var{regexp}, and substitutes
address@hidden for each match. The syntax for regular expressions is
-the same as in GNU Emacs.
address@hidden for each match. It uses Extended Regular Expressions
+syntax.
The parts of @var{string} that are not covered by any match of
@var{regexp} are copied to the expansion. Whenever a match is found, the
@@ -2981,6 +3003,66 @@ @node Patsubst
The @var{replacement} argument can be omitted, in which case the text
matched by @var{regexp} is deleted.
+The builtin macro @code{patsubst} is recognized only when given
+arguments.
address@hidden deffn
+
+When used with two arguments, while @code{eregexp} returns the position
+of the match, @code{epatsusbt} deletes it:
+
address@hidden
+epatsubst(`GNUs not Unix', `^', `OBS: ')
address@hidden: GNUs not Unix
+epatsubst(`GNUs not Unix', `\<', `OBS: ')
address@hidden: GNUs OBS: not OBS: Unix
+epatsubst(`GNUs not Unix', `\w*', `(\&)')
address@hidden(GNUs)() (not)() (Unix)
+epatsubst(`GNUs not Unix', `\w+', `(\&)')
address@hidden(GNUs) (not) (Unix)
+epatsubst(`GNUs not Unix', `[A-Z][a-z]+')
address@hidden not @comment
address@hidden example
+
+Here is a slightly more realistic example, which capitalizes individual
+word or whole sentences, by substituting calls of the macros
address@hidden and @code{downcase} into the strings.
+
address@hidden
+define(`upcase', `translit(`$*', `a-z', `A-Z')')dnl
+define(`downcase', `translit(`$*', `A-Z', `a-z')')dnl
+define(`capitalize1',
+ `eregexp(`$1', `^(\w)(\w*)', `upcase(`\1')`'downcase(`\2')')')dnl
+define(`capitalize',
+ `epatsubst(`$1', `\w+', `capitalize1(`\&')')')dnl
+capitalize(`GNUs not Unix')
address@hidden Not Unix
address@hidden example
+
+While @code{eregexp} replaces the whole input with the replacement as
+soon as there is a match, @code{epatsubst} replaces each
address@hidden of a match and preserves non matching pieces:
+
address@hidden
+define(`patreg',
+`epatsubst($@@)
+eregexp($@@)')dnl
+patreg(`bar foo baz Foo', `foo|Foo', `FOO')
address@hidden FOO baz FOO
address@hidden
+patreg(`aba abb 121', `(.)(.)\1', `\2\1\2')
address@hidden abb 212
address@hidden
address@hidden example
+
+
address@hidden {Builtin (gnu)} patsubst (@var{string}, @var{regexp}, @w{opt
@var{replacement})}
+Same as @code{epatsubst}, but using Basic Regular Expression syntax, see
address@hidden and Regexp}, for more details.
address@hidden deffn
+
address@hidden No longer interesting for the documentation per se, but good
address@hidden for testing.
address@hidden
@example
patsubst(`GNUs not Unix', `^', `OBS: ')
@result{}OBS: GNUs not Unix
@@ -2994,24 +3076,17 @@ @node Patsubst
@result{}GN not @comment
@end example
-The builtin macro @code{patsubst} is recognized only when given
-arguments.
address@hidden deffn
-
-Here is a slightly more realistic example, which capitalizes individual
-word or whole sentences, by substituting calls of the macros
address@hidden and @code{downcase} into the strings.
-
@example
-define(`upcase', `translit(`$*', `a-z', `A-Z')')dnl
+define(`upcase', `translit(`$*', `a-z', `A-Z')')dnl
define(`downcase', `translit(`$*', `A-Z', `a-z')')dnl
define(`capitalize1',
- `regexp(`$1', `^\(\w\)\(\w*\)', `upcase(`\1')`'downcase(`\2')')')dnl
+ `regexp(`$1', `^\(\w\)\(\w*\)', `upcase(`\1')`'downcase(`\2')')')dnl
define(`capitalize',
- `patsubst(`$1', `\w+', `capitalize1(`\&')')')dnl
+ `patsubst(`$1', `\w+', `capitalize1(`\&')')')dnl
capitalize(`GNUs not Unix')
@result{}Gnus Not Unix
@end example
address@hidden ignore
@node Format
@section Formatted output
@@ -3684,9 +3759,17 @@ @node Extensions
is modeled after the C library function @code{printf} (@pxref{Format}).
@item
-Searches and text substitution through regular expressions are
-supported by the @code{regexp} (@pxref{Regexp}) and @code{patsubst}
-(@pxref{Patsubst}) builtins.
+Searches and text substitution through regular expressions are supported
+by the @code{eregexp}, @code{regexp} (@pxref{Eregexp and Regexp}) and
address@hidden, @code{patsubst} (@pxref{Epatsubst and Patsubst})
+builtins.
+
address@hidden
+The syntax for regular expression has never clearly formalized for M4.
+While Open BSD M4 uses extended regular expressions for @code{regexp}
+and @code{patsubst}, @sc{gnu} M4 uses basic regular expression. Use
address@hidden (@pxref{Eregexp and Regexp}) and @code{epatsubst}
+(@pxref{Epatsubst and Patsubst}) for extended regular expressions.
@item
The output of shell commands can be read into @code{m4} with
Index: modules/gnu.c
--- modules/gnu.c Mon, 01 Oct 2001 16:22:19 +0200 akim
+++ modules/gnu.c Mon, 01 Oct 2001 16:26:42 +0200 akim
@@ -45,7 +45,9 @@
#define RE_SYNTAX_ERE \
(/* Allow char classes. */ \
RE_CHAR_CLASSES \
- /* Be picky. */ \
+ /* Anchors are OK in groups. */ \
+ | RE_CONTEXT_INDEP_ANCHORS \
+ /* Be picky, `/^?/', for instance, makes no sense. */ \
| RE_CONTEXT_INVALID_OPS \
/* Allow intervals with `{' and `}', forbid invalid ranges. */\
| RE_INTERVALS | RE_NO_BK_BRACES | RE_NO_EMPTY_RANGES \
@@ -73,6 +75,8 @@
BUILTIN(changesyntax, FALSE, TRUE ) \
BUILTIN(debugmode, FALSE, FALSE ) \
BUILTIN(debugfile, FALSE, FALSE ) \
+ BUILTIN(eregexp, FALSE, TRUE ) \
+ BUILTIN(epatsubst, FALSE, TRUE ) \
BUILTIN(esyscmd, FALSE, TRUE ) \
BUILTIN(format, FALSE, TRUE ) \
BUILTIN(indir, FALSE, TRUE ) \
@@ -304,11 +308,13 @@
/**
* regexp(STRING, REGEXP, [REPLACEMENT])
**/
-M4BUILTIN_HANDLER (regexp)
+
+static void
+m4_regexp_do (struct obstack *obs, int argc, m4_symbol **argv,
+ int syntax)
{
const char *victim; /* first argument */
const char *regexp; /* regular expression */
- const char *repl; /* replacement string */
struct re_pattern_buffer *buf;/* compiled regular expression */
struct re_registers regs; /* for subexpression matches */
@@ -321,7 +327,7 @@
victim = M4ARG (1);
regexp = M4ARG (2);
- buf = m4_regexp_compile (M4ARG(0), regexp, RE_SYNTAX_BRE);
+ buf = m4_regexp_compile (M4ARG(0), regexp, syntax);
if (!buf)
return;
@@ -331,21 +337,38 @@
if (startpos == -2)
{
M4ERROR ((warning_status, 0,
- _("Error matching regular expression `%s'"), regexp));
+ _("%s: error matching regular expression `%s'"),
+ M4ARG (0), regexp));
return;
}
if (argc == 3)
m4_shipout_int (obs, startpos);
else if (startpos >= 0)
- {
- repl = M4ARG (3);
- substitute (obs, victim, repl, ®s);
- }
+ substitute (obs, victim, M4ARG (3), ®s);
return;
}
+
+/**
+ * regexp(STRING, REGEXP, [REPLACEMENT])
+ **/
+M4BUILTIN_HANDLER (regexp)
+{
+ m4_regexp_do (obs, argc, argv, RE_SYNTAX_BRE);
+}
+
+/**
+ * regexp(STRING, REGEXP, [REPLACEMENT])
+ **/
+M4BUILTIN_HANDLER (eregexp)
+{
+ m4_regexp_do (obs, argc, argv, RE_SYNTAX_ERE);
+}
+
+
+
/* Substitute all matches of a regexp occuring in a string. Each match of
the second argument (a regexp) in the first argument is changed to the
third argument, with \& substituted by the matched text, and \N
@@ -354,7 +377,8 @@
/**
* patsubst(STRING, REGEXP, [REPLACEMENT])
**/
-M4BUILTIN_HANDLER (patsubst)
+m4_patsubst_do (struct obstack *obs, int argc, m4_symbol **argv,
+ int syntax)
{
const char *victim; /* first argument */
const char *regexp; /* regular expression */
@@ -372,7 +396,7 @@
victim = M4ARG (1);
length = strlen (victim);
- buf = m4_regexp_compile (M4ARG(0), regexp, RE_SYNTAX_BRE);
+ buf = m4_regexp_compile (M4ARG(0), regexp, syntax);
if (!buf)
return;
@@ -418,6 +442,22 @@
obstack_1grow (obs, '\0');
return;
+}
+
+/**
+ * patsubst(STRING, REGEXP, [REPLACEMENT])
+ **/
+M4BUILTIN_HANDLER (patsubst)
+{
+ m4_patsubst_do (obs, argc, argv, RE_SYNTAX_BRE);
+}
+
+/**
+ * patsubst(STRING, REGEXP, [REPLACEMENT])
+ **/
+M4BUILTIN_HANDLER (epatsubst)
+{
+ m4_patsubst_do (obs, argc, argv, RE_SYNTAX_ERE);
}
/* Implementation of "symbols" itself. It builds up a table of pointers to
- 12-eregexp-2.patch,
Akim Demaille <=