lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] First experience with std::regex from gcc 11 and CTRE in test_codi


From: Vadim Zeitlin
Subject: [lmi] First experience with std::regex from gcc 11 and CTRE in test_coding_rules
Date: Wed, 2 Jun 2021 22:56:24 +0200

 Hello,

 I'd like to summarize my findings so far, to give you an opportunity to
give feedback, should you have any.

 So far I have done some changes to Boost.Regex version in preparation for
the switch to std::regex, notably including removing the statistics display
(which is now done by "make check_concinnity" only) and splitting the regex
in check_include_guards() containing ".*" in the middle in the start and
end parts as using ".*" with std::regex results in a crash due to a stack
overflow (I usually try to be understanding and forgiving with other
people, especially in writing, but here I have no idea what could the
author of this code been thinking when they decided to implement "*"
quantifier using recursion as this is a painfully obviously horrible idea)
when used with even moderately large files. These changes haven't changed
the performance of Boost.Regex version significantly.

 After removing the statistics display I could also easily test running
"parallel -m test_coding_rules ::: *" instead of running it directly. This
results in a big improvement (~3.5 times faster) for me, but not as big as
might be expected on a 8 CPU machine where I've been testing this so far,
so there is definitely some overhead due to using GNU parallel and I'd
expect an even bigger speed up if we implemented the parallelism
internally.

 Next, I've switched to std::regex. The good news is that it's much less
horrible in gcc 11 and the new version is "only" ~70% slower (but twice as
slow when used with parallel). To understand just how "good" the
performance of std::regex in libstdc++ is consider that regex version in
libc++, used by clang, is 40 *times* slower than Boost.Regex.

 Next I tried using SRELL (http://www.akenotsuki.com/misc/srell/en/), which
is a library I've found only recently, and which looked appealing because
it's supposed to provide exactly the same interface as std::regex, so
testing with it should have been very simple. Unfortunately in practice
it's not really the case because while compiling the code with it was
indeed trivial, the run-time results differ and many tests don't pass any
more. Some of this was due to mistakes in regexes used in the code (mostly
those that I've introduced when translating them from Boost-ese to
standardese), and I've fixed a couple of them, but I haven't checked all
the other ones yet, as it looked like it was going to take more time than I
had, so I've decided to proceed directly to the next step, i.e.

 CTRE (https://github.com/hanickadot/compile-time-regular-expressions), as
previously discussed. Here my experience is also mixed. On one hand, the
library does provide all the expected convenience of compile-time regexes,
i.e. regex syntax errors are detected at compile-time, which is very nice,
even though the avalanche of errors they result in is not for the faint of
heart. Moreover, access to captures is also checked at compile-time (as one
of the problems with SRELL was using an out-of-range capture group index, I
especially appreciate this). OTOH the library doesn't really seem to be
"production quality": it doesn't have any actual documentation and trying
to compile it as part of lmi resulted in quite a few warnings, not all of
which were obviously harmless. To compensate for this, my PR fixing a bunch
of -Wundef in it was accepted 5 minutes after I submitted it by Hana, so at
the very least the author is indeed very responsive.

 But by far the main problem with CTRE are compile times. I haven't
finished translating all the regexes to CTRE yet, i.e. right now I'm using
a hybrid version in which some regexes are defined at compile-time, while
the other ones still use std::regex, but the compile time has already
ballooned from 4.8s for the current master and 8.3s for the std::regex
version to more than 30s for the current one, and, because each regex
corresponds to a separate instantiation of a bunch of templates, it's
directly proportional to the number of regexes used, so I'm very afraid
that converting all of them to CTRE will make compiling just this file take
several minutes. I realize that this might be not a huge deal in the grand
scheme of things, but it's definitely annoying when working on this file
and having to recompile it regularly. I guess I'll still try to finish
converting all, or at least as many as possible, regexes to CTRE to see
just how slow and fast the compilation and run-time speed will get with it,
but I think the best solution would be to replace many regexes with
something else, e.g. many of those used with phyloanalyze() are just fixed
strings, rather than blindly using it for everything because it's
definitely not free at all in terms of compile-time.

 And then there is another problem with CTRE: it's an _exclusively_
compile-time library. This means that regexes like private taboos (defined
in a separate file) can't be checked with it and we'll have to keep using
std::regex for them, especially because they require support for matching
ignoring case, which CTRE doesn't (and won't, at least in the near future)
provide. Maybe less obviously, but the use of the current year in the
copyright check also makes it unsuitable for use with CTRE.

 Anyhow, as I said, for now I'll continue with CTRE and try to use it as
much as possible, but I won't look to get rid of std::regex completely, as
it will be impossible due to the reasons above. It shouldn't be really
necessary from performance point of view neither, as just 3 functions
(enforce_taboos(), check_css() and check_defect_markers()) take 85% of the
total run-time, so optimizing them should be sufficient to get most of the
potential theoretical speed up.

 I'll also try to return to SRELL to see if its errors may indicate other
(potential) problems in what we're doing or if they're really just bugs in
the library itself. In any case, I don't think we're going to seriously
consider using it in production, as it doesn't seem to be that much faster
than std::regex, but I couldn't run the full benchmarks using it yet.

 And I'd also still like to try using PCRE. It's practically the gold
standard in this domain and everybody compares themselves with it, so it
looks like we should at least check how does it work for us.

 Please let me know if you have any comments, thanks in advance,
VZ

Attachment: pgpfBcHVN70RF.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]