lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] First experience with std::regex from gcc 11 and CTRE in test_


From: Greg Chicares
Subject: Re: [lmi] First experience with std::regex from gcc 11 and CTRE in test_coding_rules
Date: Wed, 2 Jun 2021 22:38:09 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0

On 6/2/21 8:56 PM, Vadim Zeitlin wrote:
[...]
>  So far I have done some changes to Boost.Regex version in preparation for
> the switch to std::regex, notably including removing the statistics display
> (which is now done by "make check_concinnity" only) and splitting the regex
> in check_include_guards() containing ".*" in the middle in the start and
> end parts as using ".*" with std::regex results in a crash due to a stack
> overflow (I usually try to be understanding and forgiving with other
> people, especially in writing, but here I have no idea what could the
> author of this code been thinking when they decided to implement "*"
> quantifier using recursion as this is a painfully obviously horrible idea)

What library has this recursive Kleene-star implementation--the
libstdc++ version corresponding to gcc version eleven?

>  After removing the statistics display I could also easily test running
> "parallel -m test_coding_rules ::: *" instead of running it directly. This
> results in a big improvement (~3.5 times faster) for me, but not as big as
> might be expected on a 8 CPU machine where I've been testing this so far,
> so there is definitely some overhead due to using GNU parallel and I'd
> expect an even bigger speed up if we implemented the parallelism
> internally.

Okay, as expected--an incremental improvement, and an expectation of
further incremental improvement--but a revolutionary change is wanted.

>  Next, I've switched to std::regex. The good news is that it's much less
> horrible in gcc 11 and the new version is "only" ~70% slower (but twice as
> slow when used with parallel). To understand just how "good" the
> performance of std::regex in libstdc++ is consider that regex version in
> libc++, used by clang, is 40 *times* slower than Boost.Regex.

So, with gcc eleven's libstdc++, it's half the speed of boost.
That is revolutionary. But then again I suppose the speed gains
could be taken away in a future version. Assuming the gains
persist, is half of boost's speed good enough?

>  Next I tried using SRELL (http://www.akenotsuki.com/misc/srell/en/), which
> is a library I've found only recently, and which looked appealing because
> it's supposed to provide exactly the same interface as std::regex, so
> testing with it should have been very simple.

That looks like a single person's project, and we'd have to
consider whether its rejection:
  http://www.akenotsuki.com/misc/srell/en/proposal/
might cause him to lose interest in maintaining it.

I couldn't tell what regex engine he uses. If he had used
PCRE, he'd achieve total world domination.

> Unfortunately in practice [...snip...]
> so I've decided to proceed directly to the next step, i.e.
> 
>  CTRE (https://github.com/hanickadot/compile-time-regular-expressions), as
> previously discussed. Here my experience is also mixed. On one hand, the
> library does provide all the expected convenience of compile-time regexes,
> i.e. regex syntax errors are detected at compile-time, which is very nice,
> even though the avalanche of errors they result in is not for the faint of
> heart.

In the past, I think I developed some regexes using grep or
sed, and pasted them into C++ after getting them to work; so
this doesn't seem to be a big disadvantage.

> [...] the library doesn't really seem to be
> "production quality": [...] To compensate for this, my PR fixing a bunch
> of -Wundef in it was accepted 5 minutes after I submitted it by Hana, so at
> the very least the author is indeed very responsive.

That doesn't sound too bad, because we'd plan to use this library
(or some other) only for 'test_coding_rules': it wouldn't be used
at all in anything we distribute to end users.

>  But by far the main problem with CTRE are compile times. [...] it's
> directly proportional to the number of regexes used, so I'm very afraid
> that converting all of them to CTRE will make compiling just this file take
> several minutes. I realize that this might be not a huge deal in the grand
> scheme of things, but it's definitely annoying when working on this file
> and having to recompile it regularly. I guess I'll still try to finish
> converting all, or at least as many as possible, regexes to CTRE to see
> just how slow and fast the compilation and run-time speed will get with it,

Yes, especially the run-time speed. If, say, it's (surprisingly)
not faster than gcc-11's std::regex, then we'd drop CTRE.

> but I think the best solution would be to replace many regexes with
> something else, e.g. many of those used with phyloanalyze() are just fixed
> strings, rather than blindly using it for everything because it's
> definitely not free at all in terms of compile-time.

In many cases, we could just scan strings using C. Perhaps in
some other cases we'd find std::regex would be fast enough.

Maybe we could split 'test_coding_rules.cpp' into several files
that could be compiled separately in parallel. We'd only need
to recompile each one when it actually changes, i.e., rarely.

And we could compile 'test_coding_rules' with only a single
architecture, pc-linux-gnu obviously. It's an auxiliary tool.
We don't recompile git for msw and run it under wine.

>  And then there is another problem with CTRE: it's an _exclusively_
> compile-time library. This means that regexes like private taboos (defined
> in a separate file) can't be checked with it and we'll have to keep using
> std::regex for them, especially because they require support for matching
> ignoring case, which CTRE doesn't (and won't, at least in the near future)
> provide.

I could reexamine those. Some might not be wanted any more.
Others might be suitable for putting in the public repository.
Or maybe we could compile any private taboos separately and
just link everything together.

> Maybe less obviously, but the use of the current year in the
> copyright check also makes it unsuitable for use with CTRE.

I'd guess that just means we'd have to recompile it every year?

>  Anyhow, as I said, for now I'll continue with CTRE [...] 3 functions
> (enforce_taboos(), check_css() and check_defect_markers()) take 85% of the
> total run-time, so optimizing them should be sufficient to get most of the
> potential theoretical speed up.

Perhaps focusing on those three functions only, or even on just
one of them, would tell us how fast CTRE might be.

>  And I'd also still like to try using PCRE. It's practically the gold
> standard in this domain and everybody compares themselves with it, so it
> looks like we should at least check how does it work for us.

At first, I was thinking this would mean writing a C++ wrapper
for PCRE, kind of like the way SRELL's API was designed (i.e.,
by copying the standard). But maybe it would be much less work
to use PCRE's C API directly, for the three functions that take
most of the time--or, as a useful experiment, for just one.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]