[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unquoted special characters in regexps
From: |
Luc Teirlinck |
Subject: |
Re: Unquoted special characters in regexps |
Date: |
Mon, 6 Mar 2006 23:52:44 -0600 (CST) |
Richard Stallman wrote:
I think the manual needs to explain both levels--the first level so
beginners can begin to understand, and the second level for precise
thinking about counterintuitive regexps.
I could certainly do that, but I am terribly overloaded. Would
someone else like to try it?
What about the following patch, which I can install if desired?
It includes one unrelated change dealing with a problem I noticed in
the process. It moves a paragraph occurring currently in the
description of `*' to the description of `+'. (Although, from diff's
perspective, it instead moves the definition of `+' up till before
that paragraph. Everything is relative, I guess.) The reason is that
the paragraph discusses the regexp "(x+y*\)*a" before the meaning of
`+' is explained. This makes `x+y' look like is the sum of x and y.
Also the remarks in the paragraph apply to both `*' and `+'.
===File ~/searching.texi-diff===============================
*** searching.texi 06 Feb 2006 16:02:08 -0600 1.68
--- searching.texi 06 Mar 2006 23:47:42 -0600
***************
*** 235,246 ****
Regular expressions have a syntax in which a few characters are
special constructs and the rest are @dfn{ordinary}. An ordinary
! character is a simple regular expression that matches that character and
! nothing else. The special characters are @samp{.}, @samp{*}, @samp{+},
! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new
! special characters will be defined in the future. Any other character
! appearing in a regular expression is ordinary, unless a @samp{\}
! precedes it.
For example, @samp{f} is not a special character, so it is ordinary, and
therefore @samp{f} is a regular expression that matches the string
--- 235,249 ----
Regular expressions have a syntax in which a few characters are
special constructs and the rest are @dfn{ordinary}. An ordinary
! character is a simple regular expression that matches that character
! and nothing else. The special characters are @samp{.}, @samp{*},
! @samp{+}, @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new
! special characters will be defined in the future. The character
! @samp{]} is special if it ends a character alternative (see later).
! The character @samp{-} is special inside a character alternative. A
! @samp{[:} and balancing @samp{:]} enclose a character class inside a
! character alternative. Any other character appearing in a regular
! expression is ordinary, unless a @samp{\} precedes it.
For example, @samp{f} is not a special character, so it is ordinary, and
therefore @samp{f} is a regular expression that matches the string
***************
*** 301,306 ****
--- 304,316 ----
The next alternative is for @samp{a*} to match only two @samp{a}s. With
this choice, the rest of the regexp matches address@hidden
+ @item @samp{+}
+ @cindex @samp{+} in regexp
+ is a postfix operator, similar to @samp{*} except that it must match
+ the preceding expression at least once. So, for example, @samp{ca+r}
+ matches the strings @samp{car} and @samp{caaaar} but not the string
+ @samp{cr}, whereas @samp{ca*r} matches all three strings.
+
Nested repetition operators take a long time, or even forever, if they
lead to ambiguous matching. For example, trying to match the regular
expression @samp{\(x+y*\)*a} against the string
***************
*** 311,323 ****
it causes an infinite loop. To avoid these problems, check nested
repetitions carefully.
- @item @samp{+}
- @cindex @samp{+} in regexp
- is a postfix operator, similar to @samp{*} except that it must match
- the preceding expression at least once. So, for example, @samp{ca+r}
- matches the strings @samp{car} and @samp{caaaar} but not the string
- @samp{cr}, whereas @samp{ca*r} matches all three strings.
-
@item @samp{?}
@cindex @samp{?} in regexp
is a postfix operator, similar to @samp{*} except that it must match the
--- 321,326 ----
***************
*** 468,473 ****
--- 471,504 ----
can act. It is poor practice to depend on this behavior; quote the
special character anyway, regardless of where it address@hidden
+ As a @samp{\} is not special inside a character alternative, it can
+ never remove the special meaning of @samp{-} or @samp{]}. So you
+ should not quote these characters when they have no special meaning
+ either. This would not clarify anything, since backslashes can
+ legitimately precede these characters where they @emph{have} special
+ meaning, as in @code{[^\]} (@code{"[^\\]"} for Lisp string syntax),
+ which matches any single character except a backslash.
+
+ In practice, most @samp{]} that occur in regular expressions close a
+ character alternative and hence are special. However, occasionally a
+ regular expression may try to match a complex pattern of literal
+ @samp{[} and @samp{]}. In such situations, it sometimes may be
+ necessary to carefully parse the regexp from the start to determine
+ which square brackets enclose a character alternative. For example,
+ @code{[^][]]}, consists of the complemented character alternative
+ @code{[^][]}, which matches any single character that is not a square
+ bracket, followed by a literal @samp{]}.
+
+ The exact rules are that at the beginning of a regexp, @samp{[} is
+ special and @samp{]} not. This lasts until the first unquoted
+ @samp{[}, after which we are in a character alternative; @samp{[} is
+ no longer special (except if it starts a character class) but @samp{]}
+ is special, unless it immediately follows the special @samp{[} or that
+ @samp{[} followed by a @samp{^}. This lasts until the next special
+ @samp{]} that does not end a character class. This ends the character
+ alternative and restores the ordinary syntax of regular expressions;
+ an unquoted @samp{[} is special again and a @samp{]} not.
+
@node Char Classes
@subsubsection Character Classes
@cindex character classes in regexp
***************
*** 740,747 ****
@kindex invalid-regexp
Not every string is a valid regular expression. For example, a string
! with unbalanced square brackets is invalid (with a few exceptions, such
! as @samp{[]]}), and so is a string that ends with a single @samp{\}. If
an invalid regular expression is passed to any of the search functions,
an @code{invalid-regexp} error is signaled.
--- 771,778 ----
@kindex invalid-regexp
Not every string is a valid regular expression. For example, a string
! that ends inside a character alternative without terminating @samp{]}
! is invalid, and so is a string that ends with a single @samp{\}. If
an invalid regular expression is passed to any of the search functions,
an @code{invalid-regexp} error is signaled.
============================================================
- Re: Unquoted special characters in regexps, (continued)
- Re: Unquoted special characters in regexps, martin rudalics, 2006/03/01
- Re: Unquoted special characters in regexps, Andreas Schwab, 2006/03/02
- Re: Unquoted special characters in regexps, martin rudalics, 2006/03/01
- Re: Unquoted special characters in regexps, Andreas Schwab, 2006/03/02
- Re: Unquoted special characters in regexps, martin rudalics, 2006/03/02
- Re: Unquoted special characters in regexps, Andreas Schwab, 2006/03/01
- Re: Unquoted special characters in regexps, Richard Stallman, 2006/03/02
- Re: Unquoted special characters in regexps, martin rudalics, 2006/03/04
- Re: Unquoted special characters in regexps, Luc Teirlinck, 2006/03/02
- Re: Unquoted special characters in regexps, Richard Stallman, 2006/03/06
- Re: Unquoted special characters in regexps,
Luc Teirlinck <=
- Re: Unquoted special characters in regexps, martin rudalics, 2006/03/07
- Re: Unquoted special characters in regexps, Luc Teirlinck, 2006/03/04
- Re: Unquoted special characters in regexps, Thien-Thi Nguyen, 2006/03/04
- Re: Unquoted special characters in regexps, Luc Teirlinck, 2006/03/04
- Re: Unquoted special characters in regexps, martin rudalics, 2006/03/05
- Re: Unquoted special characters in regexps, Luc Teirlinck, 2006/03/05
- Re: Unquoted special characters in regexps, martin rudalics, 2006/03/06
- Re: Unquoted special characters in regexps, Luc Teirlinck, 2006/03/05
- Re: Unquoted special characters in regexps, martin rudalics, 2006/03/05
- Re: Unquoted special characters in regexps, Andreas Schwab, 2006/03/05