lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Continuing deboostification with removing dependency on Boost.


From: Vadim Zeitlin
Subject: Re: [lmi] Continuing deboostification with removing dependency on Boost.Regex
Date: Fri, 28 May 2021 18:20:28 +0200

On Fri, 28 May 2021 14:14:16 +0000 Greg Chicares <gchicares@sbcglobal.net> 
wrote:

GC> On 5/27/21 5:04 PM, Vadim Zeitlin wrote:
GC> 
GC> [...after drastic snippage, the options are...]
GC> 
GC> > 0. Still switch to std::regex, but optimize test_coding_rules
GC> > 1. Switch to some other C++ regex library. My favourite one remains PCRE
GC> > 2. Rewrite test_coding_rules in some other language
GC> First of all, should we first consider:
GC>   https://github.com/hanickadot/compile-time-regular-expressions
GC> to see whether that's a magic answer to our regex woes?

 This falls under the solution (1). But when I had suggested looking into
this library (CTRE) previously, you asked me whether I understood its code
well enough to make changes to it if necessary and my honest answer was
that I definitely didn't -- and this didn't change since then. Of course, I
don't really understand PCRE neither, but I'm pretty sure that there will
be other people maintaining it for the foreseeable future, so from
maintainability perspective PCRE is surely a much safer bet than CTRE.

GC> Does 'test_coding_rules' become fast enough if we rewrite
GC> at least some of its expressions using this library?

 It almost probably will, yes. In principle, it should be even faster than
if we use PCRE. But I think that PCRE should be fast enough, while being
much safer to depend on.

GC> I'm not willing to rewrite a 1300-line program in a new
GC> language that I don't know in order to optimize nineteen
GC> or fewer expressions.

 For once I did manage to predict your reaction correctly, so I'm not
really surprised, but in my defense, I've only brought this solution up
because the original idea to do it (in Perl) was yours. And, of course,
because I do see other advantages in having this programs in some other,
preferably not requiring compilation, language, as I wrote in my first
post.


GC> https://lists.nongnu.org/archive/html/lmi/2016-06/msg00060.html
GC> | >  FWIW replacing boost::regex with std::regex is trivial, they have 
exactly
GC> | > the same API. However a couple of regexes will need to be adjusted 
because
GC> | > C++11 std::regex doesn't support Perl extended syntax for some reason. A
GC> | > simple grep finds only 4 occurrences of it and 3 of them use the syntax
GC> | > which is supported by ECMAScript regexes and hence C++11, the only 
problem
GC> | > is the use of "(?-s)" to turn off the "s" flag in regex_test.cpp, so 
this
GC> | > test would need to be modified to avoid using it, but it shouldn't be a
GC> | > problem.
GC> | 
GC> | Since you know how to grep for perl regexes, I think I'd better leave
GC> | this task to you.
GC> 
GC> Because you've shared speed measurements, I assume you've
GC> already made those changes locally; would you share them
GC> with me please?

 I'll have to rebase my ~5 year old commit on master first, as,
unsurprisingly, it doesn't apply cleanly, but while/when I do it, I could
also redo the benchmarks, and maybe profiling, myself. Would you like me to
do it or do you really want to do it on your own?

GC> Until we know which regexes are actually costly, I don't
GC> believe we understand the problem well enough to proceed to
GC> a discussion of solutions. For example, if measurements show
GC> that this regex is really costly:
GC> 
GC>   static boost::regex const forbidden(R"([\x00-\x08\x0e-\x1f\x7f-\x9f])");
GC> 
GC> then an inline comment already states an alternative that
GC> might suggest a simple, tidy solution:
GC> 
GC>   LC_ALL=C grep -P '[\x00-\x08\x0e-\x1f\x7f-\x9f]' list-of-files

 Shelling out to do a search from a C++ program would be the weirdest
optimization I have ever seen...

GC> How many regexes will prove to be really costly? Ten? Three?
GC> Once we identify the culprits, my preferences would be:
GC>  - first, optimize the regexes themselves
GC>  - then, weigh each one's cost against its benefits, and
GC>      ask whether some less powerful but much faster test
GC>      would serve the same purpose well enough
GC>  - then, consider pushing the worst offenders out of
GC>      C++ and into a standard utility such as 'grep'
GC>  - as a last resort, consider using a different library

 Does this mean that we shouldn't even attempt parallelizing the checks?
It should be an almost guaranteed order of magnitude factor win (because
modern systems have O(10) CPUs), so I'd strongly consider doing it
independently of anything else. Do you object to this?

 Thanks,
VZ

Attachment: pgpU_Y8xxYP9o.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]