lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Continuing deboostification with removing dependency on Boost.


From: Greg Chicares
Subject: Re: [lmi] Continuing deboostification with removing dependency on Boost.Regex
Date: Fri, 28 May 2021 14:14:16 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0

On 5/27/21 5:04 PM, Vadim Zeitlin wrote:

[...after drastic snippage, the options are...]

> 0. Still switch to std::regex, but optimize test_coding_rules
> 1. Switch to some other C++ regex library. My favourite one remains PCRE
> 2. Rewrite test_coding_rules in some other language
First of all, should we first consider:
  https://github.com/hanickadot/compile-time-regular-expressions
to see whether that's a magic answer to our regex woes?
Does 'test_coding_rules' become fast enough if we rewrite
at least some of its expressions using this library? Adding
one 14000-line header that would be used by 'test_coding_rules'
seems better than installing a 'boost' monolith. (I'm not sure
whether her license conflicts with the GPL, but we could look
into that after determining whether this is a feasible solution).

Assuming that doesn't work...let's start with some measurements:

/opt/lmi/src/lmi[0]$wc -l test_coding_rules.cpp 
1314 test_coding_rules.cpp

/opt/lmi/src/lmi[0]$git grep 'static boost::regex const' test_coding_rules.cpp 
|wc -l
19

AIUI, the concern is limited to about nineteen regexes.
I'm not willing to rewrite a 1300-line program in a new
language that I don't know in order to optimize nineteen
or fewer expressions. I say "or fewer" because it's easy
to rewrite some of them, e.g.:

  postinitial_tab(R"([^\n]\t)");
  r(R"(# *endif\n)");

in blazingly fast K&R C; or maybe they're already fast
enough with std::regex.

I tried to identify the performance hotspots by switching
to std::regex, but there's an incompatibility:

https://lists.nongnu.org/archive/html/lmi/2016-06/msg00060.html
| >  FWIW replacing boost::regex with std::regex is trivial, they have exactly
| > the same API. However a couple of regexes will need to be adjusted because
| > C++11 std::regex doesn't support Perl extended syntax for some reason. A
| > simple grep finds only 4 occurrences of it and 3 of them use the syntax
| > which is supported by ECMAScript regexes and hence C++11, the only problem
| > is the use of "(?-s)" to turn off the "s" flag in regex_test.cpp, so this
| > test would need to be modified to avoid using it, but it shouldn't be a
| > problem.
| 
| Since you know how to grep for perl regexes, I think I'd better leave
| this task to you.

Because you've shared speed measurements, I assume you've
already made those changes locally; would you share them
with me please?

Until we know which regexes are actually costly, I don't
believe we understand the problem well enough to proceed to
a discussion of solutions. For example, if measurements show
that this regex is really costly:

  static boost::regex const forbidden(R"([\x00-\x08\x0e-\x1f\x7f-\x9f])");

then an inline comment already states an alternative that
might suggest a simple, tidy solution:

  LC_ALL=C grep -P '[\x00-\x08\x0e-\x1f\x7f-\x9f]' list-of-files

but more likely we'd eliminate that test, because we've
already decided to allow arbitrary UTF-8 (now that nobody's
going to use an ancient msw editor like 'notepad' for lmi).

How many regexes will prove to be really costly? Ten? Three?
Once we identify the culprits, my preferences would be:
 - first, optimize the regexes themselves
 - then, weigh each one's cost against its benefits, and
     ask whether some less powerful but much faster test
     would serve the same purpose well enough
 - then, consider pushing the worst offenders out of
     C++ and into a standard utility such as 'grep'
 - as a last resort, consider using a different library


reply via email to

[Prev in Thread] Current Thread [Next in Thread]