[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
grep branch, master, updated. v3.11-52-g6ee8562
From: |
Paul Eggert |
Subject: |
grep branch, master, updated. v3.11-52-g6ee8562 |
Date: |
Mon, 16 Dec 2024 16:43:10 -0500 (EST) |
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".
The branch, master has been updated
via 6ee856200a8a2a901a95900766aec66914331862 (commit)
from 19e301ad53931c8832789d283e2e5303fbf3452b (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=6ee856200a8a2a901a95900766aec66914331862
commit 6ee856200a8a2a901a95900766aec66914331862
Author: Paul Eggert <eggert@cs.ucla.edu>
Date: Mon Dec 16 01:40:55 2024 -0700
grep: revert recent \d change
I misread the email thread and thought there was consensus
for the \d change, but there was wasnât so revert the change.
Also, document the resulting confusion
somewhat better than it was documented before.
* src/pcresearch.c, tests/pcre-ascii-digits, tests-pcre-utf8-w:
Revert recent \d change, restoring the behavior to that of grep 3.11.
diff --git a/NEWS b/NEWS
index 7e482b5..4294fc6 100644
--- a/NEWS
+++ b/NEWS
@@ -4,14 +4,6 @@ GNU grep NEWS -*- outline
-*-
** Bug fixes
- With -P, \d now matches all decimal digits, not just ASCII digits.
- That is, \d is equivalent to [[:digit:]], not to [0-9].
- This is more compatible with plain Perl, and reverts to the
- behavior of grep 3.9. If you prefer \d to mean [0-9] and
- have a PCRE2 version later than 10.42 installed, you can
- prefix your regular expression with (?aD).
- [bug introduced in grep 3.10]
-
Searching a directory with at least 100,000 entries no longer fails
with "Operation not supported" and exit status 2. Now, this prints 1
and no diagnostic, as expected:
diff --git a/doc/grep.texi b/doc/grep.texi
index cc6f26d..c18b94c 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1173,6 +1173,40 @@ invoked with @option{-u} or @option{-U}.
In contrast, in a UTF-8 locale @command{grep} and @command{git grep}
always treat data as UTF-8.
+@item
+In Perl and @command{git grep -P}, @samp{\d} matches all Unicode digits,
+even if they are not ASCII.
+For example, @samp{\d} matches
+@ifnottex
+``Ù£''
+@end ifnottex
+(U+0663 ARABIC-INDIC DIGIT THREE).
+In contrast, in @samp{grep -P}, @samp{\d} matches only
+the ten ASCII digits, regardless of locale.
+In @command{pcre2grep}, @samp{\d} ordinarily behaves like Perl and
+@command{git grep -P}, but when given the @option{--posix-digit} option
+it behaves like @samp{grep -P}.
+(On all platforms, @samp{\D} matches the complement of @samp{\d}.)
+
+@item
+The pattern @samp{[[:digit:]]} matches all Unicode digits
+in Perl, @samp{grep -P}, @command{git grep -P}, and @command{pcre2grep},
+so you can use it
+to get the effect of Perl's @samp{\d} on all these platforms.
+In other words, in Perl and @command{git grep -P},
+@samp{\d} is equivalent to @samp{[[:digit:]]},
+whereas in @samp{grep -P}, @samp{\d} is equivalent to @samp{[0-9]},
+and @command{pcre2grep} ordinarily follows Perl but
+when given @option{--posix-digit} it follows @samp{grep -P}.
+
+(On all these platforms, @samp{[[:digit:]]} is equivalent to @samp{\p@{Nd@}}
+and to @samp{\p@{General_Category: Decimal_Number@}}.)
+
+@item
+If @command{grep} is built with PCRE2 version 10.43 (2024) or later,
+@samp{(?aD)} causes @samp{\d} to behave like @samp{[0-9]} and
+@samp{(?-aD)} causes it to behave like @samp{[[:digit:]]}.
+
@item
Although PCRE tracks the syntax and semantics of Perl's regular
expressions, the match is not always exact. Perl
@@ -1180,22 +1214,6 @@ evolves and a Perl installation may predate or postdate
the PCRE2
installation on the same host, or their Unicode versions may differ,
or Perl and PCRE2 may disagree about an obscure construct.
-For example, on UTF-8 data @samp{[0-9]} matches only ASCII digits,
-whereas @samp{\d} ordinarily is like @samp{[[:digit:]]},
-@samp{\p@{Nd@}} and @samp{\p@{General_Category: Decimal_Number@}}
-and also matches non-ASCII digits like
-@c This does not display correctly in PDF with texinfo 7.1
-@c and pdfTeX 3.141592653-2.6-1.40.25 (TeX Live 2023/Fedora 40).
-@ifnottex
-``Ù£''
-@end ifnottex
-(U+0663 ARABIC-INDIC DIGIT THREE).
-You can change this by starting a regular expression with
-@samp{(?aD)}, which causes @samp{\d} to act like @samp{[0-9]}.
-However, @samp{(?aD)} and its inverse @samp{(?-aD)} are available only
-if @command{grep} is built with PCRE2 version 10.43 (2024) or later.
-(@samp{\D} always matches the complement of @samp{\d}.)
-
@item
By default, @command{grep} applies each regexp to a line at a time,
so the @samp{(?s)} directive (making @samp{.} match line breaks)
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 4a08531..4d79425 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -33,6 +33,9 @@
# define PCRE2_ERROR_DEPTHLIMIT PCRE2_ERROR_RECURSIONLIMIT
# define pcre2_set_depth_limit pcre2_set_recursion_limit
#endif
+#ifndef PCRE2_EXTRA_ASCII_BSD
+# define PCRE2_EXTRA_ASCII_BSD 0
+#endif
/* Use PCRE2_MATCH_INVALID_UTF if supported and not buggy;
see <https://github.com/PCRE2Project/pcre2/issues/224>.
@@ -165,11 +168,19 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t
ignored, bool exact)
if (! localeinfo.using_utf8)
die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales"));
- flags |= PCRE2_UTF | PCRE2_UCP;
+ flags |= PCRE2_UTF;
/* If supported, consider invalid UTF-8 as a barrier not an error. */
flags |= MATCH_INVALID_UTF;
+ /* If PCRE2_EXTRA_ASCII_BSD is available, use PCRE2_UCP
+ so that \d does not have the undesirable effect of matching
+ non-ASCII digits. Otherwise (i.e., with PCRE2 10.42 and earlier),
+ escapes like \w have only their ASCII interpretations,
+ but that's better than the confusion that would ensue if \d
+ matched non-ASCII digits. */
+ flags |= PCRE2_EXTRA_ASCII_BSD ? PCRE2_UCP : 0;
+
#if 0
/* Do not match individual code units but only UTF-8. */
flags |= PCRE2_NEVER_BACKSLASH_C;
@@ -180,12 +191,16 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t
ignored, bool exact)
if (rawmemchr (pattern, '\n') != patlim)
die (EXIT_TROUBLE, 0, _("the -P option only supports a single pattern"));
+#ifdef PCRE2_EXTRA_MATCH_LINE
+ uint32_t extra_options = (PCRE2_EXTRA_ASCII_BSD
+ | (match_lines ? PCRE2_EXTRA_MATCH_LINE : 0));
+ pcre2_set_compile_extra_options (ccontext, extra_options);
+#endif
+
void *re_storage = nullptr;
if (match_lines)
{
-#ifdef PCRE2_EXTRA_MATCH_LINE
- pcre2_set_compile_extra_options (ccontext, PCRE2_EXTRA_MATCH_LINE);
-#else
+#ifndef PCRE2_EXTRA_MATCH_LINE
static char const *const xprefix = "^(?:";
static char const *const xsuffix = ")$";
idx_t re_size = size + strlen (xprefix) + strlen (xsuffix);
diff --git a/tests/pcre-ascii-digits b/tests/pcre-ascii-digits
index 50fe251..b5ddc94 100755
--- a/tests/pcre-ascii-digits
+++ b/tests/pcre-ascii-digits
@@ -29,27 +29,27 @@ printf '\331\240\331\241\331\242\331\243\331\244' > in ||
framework_failure_
printf '\331\245\331\246\331\247\331\250\331\251' >> in || framework_failure_
printf '\n' >> in || framework_failure_
-# Ensure that (?aD)\d matches no character.
-returns_ 1 grep -P '(?aD)\d' in > out || fail=1
+# Ensure that \d matches no Arabic-Indic digits.
+returns_ 1 grep -P '\d' in > out || fail=1
compare /dev/null out || fail=1
-# Ensure that (?aD)^\D+$ matches the entire line.
-grep -P '(?aD)^\D+$' in > out || fail=1
+# Ensure that ^\D+$ matches all the Arabic-Indic digits.
+grep -P '^\D+$' in > out || fail=1
compare in out || fail=1
# When built with PCRE2 10.43 and newer, one may use (?aD) and (?-aD)
-# to toggle between modes. (?aD) makes \d == [0-9].
-# (?-aD), the default, makes \d match all digits.
-# Use mixed digits as input: Arabic 0 and ASCII 4: Ù 4
+# to toggle between modes. (?aD) is the default (making \d == [0-9]).
+# (?-aD) relaxes \d, making it match "all" digits.
+# Use mixed digits as input: Arabic-Indic digit zero and ASCII 4.
printf '\331\2404\n' > in2 || framework_failure_
-returns_ 1 grep -P '(?aD)\d\d' in2 > out || fail=1
+returns_ 1 grep -P '\d\d' in2 > out || fail=1
compare /dev/null out || fail=1
-grep -P '\d(?aD)\d' in2 > out || fail=1
+grep -P '(?-aD)\d(?aD)\d' in2 > out || fail=1
compare in2 out || fail=1
-returns_ 1 grep -P '(?aD)\d(?-aD)\d' in2 > out || fail=1
+returns_ 1 grep -P '\d(?-aD)\d' in2 > out || fail=1
compare /dev/null out || fail=1
Exit $fail
diff --git a/tests/pcre-utf8-w b/tests/pcre-utf8-w
index 86ff8eb..1229da4 100755
--- a/tests/pcre-utf8-w
+++ b/tests/pcre-utf8-w
@@ -16,6 +16,8 @@ require_pcre_
echo . | grep -qP '(*UTF).' 2>/dev/null \
|| skip_ 'PCRE unicode support is compiled out'
+echo 0 | grep -qP '(?aD)\d' \
+ || skip_ 'PCRE 10.42 and older lack PCRE2_EXTRA_ASCII_BSD'
fail=0
-----------------------------------------------------------------------
Summary of changes:
NEWS | 8 --------
doc/grep.texi | 50 +++++++++++++++++++++++++++++++++----------------
src/pcresearch.c | 23 +++++++++++++++++++----
tests/pcre-ascii-digits | 20 ++++++++++----------
tests/pcre-utf8-w | 2 ++
5 files changed, 65 insertions(+), 38 deletions(-)
hooks/post-receive
--
grep
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- grep branch, master, updated. v3.11-52-g6ee8562,
Paul Eggert <=