lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Results of comparison of {boost,std}::regex, CTRE and PCRE


From: Vadim Zeitlin
Subject: [lmi] Results of comparison of {boost,std}::regex, CTRE and PCRE
Date: Thu, 10 Jun 2021 23:36:28 +0200

 Hello,

 I think I finally know enough to suggest the best replacement for
Boost.Regex in test_coding_rules.cpp, so even though I didn't fully finish
everything yet, I'd like to already summarize my results so far to see if
you agree with my choice.

 First of all, here are the results of benchmarking the run-time of
test_coding_rules using the following command (in zsh, to be able to use
the "~" glob character in order to discard the effect of changes to this
file itself in the different versions)

        $ test_coding_rules *~test_coding_rules.cpp

on one particular (not very fast) machine for the following versions of lmi
(please note that CTRE/PCRE ones are not finished and rely on headers not
included in the repository, so you won't be able to compile them currently,
I only link to them in order to show the changes to the lmi sources that
would be needed to use them):

0. Original f9276c05f (Rework minimum bounds for solves, 2021-05-27).
1. Using std::regex (https://github.com/vadz/lmi/commit/2026e7227).
2. Using CTRE (https://github.com/vadz/lmi/commit/67b7cf33c).
3. Using PCRE (https://github.com/vadz/lmi/commit/f1a35217b).

        +------------------+------+------+------+------+
        | Compiler\Version |   0  |   1  |   2  |   3  |
        |------------------+------+------+------+------+
        |       gcc 11     | 10.5 | 17.8 |  1.3 |  1.6 |
        |      clang 12    |  6.1 |238   | 24   |  1.7 |
        +------------------+------+------+------+------+

(result for (1) for clang is _not_ a typo).

 As previously mentioned, all the results can be improved by a factor of
2-3 by using GNU parallel, here are the timings for

        $ parallel -m test_coding_rules ::: *~test_coding_rules.cpp

for completeness:

        +------------------+------+------+------+------+
        | Compiler\Version |   0  |   1  |   2  |   3  |
        |------------------+------+------+------+------+
        |       gcc 11     |  3.0 |  5.8 |  0.6 |  0.7 |
        |      clang 12    |  1.8 | 72   |  7.8 |  0.7 |
        +------------------+------+------+------+------+

 And the last table I'd like to show is the one for compilation time of
test_coding_rules.o:

        +------------------+------+------+------+------+
        | Compiler\Version |   0  |   1  |   2  |   3  |
        |------------------+------+------+------+------+
        |       gcc 11     |  4.8 |  8.3 | 42.9 |  4.2 |
        |      clang 12    |  4.3 |  5.6 | 33.0 |  3.3 |
        +------------------+------+------+------+------+


 It's a bit of a letdown because I had started by thinking that PCRE was
going to be the best choice even before starting to do all this, but
looking at the tables above it's IMO hard to argue that it isn't.
Experimenting with CTRE was interesting and I don't think the time spent on
it was wasted and, at the very least, I can confirm that CTRE does have
excellent performance, at least with gcc. For the simplest possible regex,
matching a literal string, it is only ~2 times slower compared to calling
contains(), while PCRE is ~10 times slower and {boost,std}::regex are ~50
times slower. Of course, for such simple tests using contains() is still
faster and simpler, but, generally speaking, CTRE beats PCRE (although
there are some exceptions).

 However CTRE has a number of disadvantages too:

- It's described as experimental and not deemed for production by its
  author and while we were not planning to use it for anything critical,
  it still risks being a problem because I'm almost sure that its API will
  change in incompatible ways if it's going to evolve at all.

- It lacks several features used in the current code, such as
  case-insensitive matching (hence the need to write /[wW][iI][nN]32/ in
  enforce_taboos()) and doesn't implement backtracking properly, which
  means that some regexes become more complicated (although maybe more
  efficient), and probably some more that are not used yet, but might be in
  the future, i.e. it's not feature-complete neither.

- Its compile times are pretty bad. The version benchmarked here still uses
  std::regex for many tests that were not worth optimizing. If we wanted to
  switch all regexes to CTRE, the compilation time would probably double as
  it's directly proportional to the number of regexes used in the code.

- It still has to be combined with something else (std::regex in the
  version tested) for matches that can't be done at compile-time.

- Using it requires a lot of changes to the existing code. Compare the
  output of "git show --stat test_coding_rules.cpp" for the various
  commits:

    - std::regex: 89 insertions(+), 86 deletions(-)
    - PCRE: 63 insertions(+), 91 deletions(-)
    - CTRE: 209 insertions(+), 210 deletions(-)

  And the changes when switching to CTRE are much less trivial and involve
  changing both the code and some of the regexes, while switching to
  std::regex requires only changing a couple of regexes and switching to
  PCRE requires only a small change to a single regex.


 So the only reasonable choices, IMO, are either sticking with std::regex
and living with its bad performance, maybe by partially compensating it
with GNU parallel, or switching to PCRE.

 The latter has a number of advantages:

- It is significantly faster. Waiting 18s for the full test_coding_rules
  run on this machine is too annoying, and while using GNU parallel helps,
  6s is still quite slow, while with the PCRE version I don't even need to
  use parallel as 1.6s is fast enough. I also didn't spend any time at all
  on optimizing the PCRE version, as its performance was already good
  enough for me, but I'm pretty sure that I could improve it more if
  needed.

- Its performance is consistent between compilers and platforms. Moreover,
  it's good (basically the same) even when not using the optimizations, as
  all the real work is done inside the PCRE library, while the performance
  of debug/non-optimized builds of all the other versions is much worse.

- It is robust and personally I trust PCRE, used by many critical projects,
  more than I trust std::regex that has a (not completely undeserved, even
  if it is clearly much improved in the latest libstdc++ versions)
  reputation of being low quality, and so probably is not used that much.
  I didn't run into any problems with it, while I found several in CTRE
  and a couple with libstdc++ std::regex implementation.


 Of course, it does have 2 disadvantages too:

1. It requires PCRE library.
2. It needs a C++ wrapper as using PCRE C API directly is too painful.

 For (2), I've written a ~500 line (including a lot of blank lines and
comments) pcre_regex.hpp header, which I don't show yet, because I need to
clean it up and add more extensive tests for it, which provides pcre::regex
class and pcre::search() and pcre::replace() functions that are very
similar to boost::regex_search() and regex_replace(), as well as
pcre::search_all() which allows for much simpler iteration over regex
matches than boost or std::sregex_iterator using a simple range for loop:

    for(auto const& z : pcre::search_all(f.data(), R"(...)"))
        {
        ... use z[0] for the full match or z[1] for the first captured
            expression here ...
        }

I don't know if you'd like to make this header part of lmi or handle it as
an external dependency, but in any case I'm ready to maintain it. And while
having this separate header is a disadvantage, compared to just using
<regex> from the standard library, the API above is clearly much nicer than
dealing with sregex_iterator directly.

 For (1), we can rely on the system package under Linux, but we'd need to
compile PCRE ourselves under MSW. This is not difficult to do, but, as I
mentioned before, I'd like to also switch to using PCRE for wxRegEx, and if
I do this, we could get the compiled PCRE library as a byproduct of
building wx without doing anything extra in lmi itself.

 Solving both of these problems shouldn't take that long, but I'd still
like to ask for your agreement before continuing to work on it. It's not
urgent at all, as I have plenty of other things to do in the meanwhile, but
I don't want to spend more time on polishing my PCRE C++ wrapper if you
decide that we don't need it and std::regex is, finally, good enough, for
example.


 To summarize, the current state is:

- I have a branch with ready to be submitted changes with some minor
  improvements to the current version, that could (and should) be merged
  in any case.

- I have a commit replacing boost::regex with std::regex that is also
  (almost) ready to be submitted -- I just need to test it a bit more.

- I also have not quite finished, but already working, changes using PCRE
  that I need to finalize before submitting them.


 Please let me know how would you like to proceed and thanks in advance,
VZ

Attachment: pgp1Ek0Ub14yh.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]