[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Continuing deboostification with removing dependency on Boost.
From: |
Greg Chicares |
Subject: |
Re: [lmi] Continuing deboostification with removing dependency on Boost.Regex |
Date: |
Fri, 28 May 2021 20:37:39 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 |
On 5/28/21 4:20 PM, Vadim Zeitlin wrote:
> On Fri, 28 May 2021 14:14:16 +0000 Greg Chicares <gchicares@sbcglobal.net>
> wrote:
>
> GC> On 5/27/21 5:04 PM, Vadim Zeitlin wrote:
> GC>
> GC> [...after drastic snippage, the options are...]
> GC>
> GC> > 0. Still switch to std::regex, but optimize test_coding_rules
> GC> > 1. Switch to some other C++ regex library. My favourite one remains PCRE
> GC> > 2. Rewrite test_coding_rules in some other language
> GC> First of all, should we first consider:
> GC> https://github.com/hanickadot/compile-time-regular-expressions
> GC> to see whether that's a magic answer to our regex woes?
>
> This falls under the solution (1). But when I had suggested looking into
> this library (CTRE) previously, you asked me whether I understood its code
> well enough to make changes to it if necessary and my honest answer was
> that I definitely didn't -- and this didn't change since then.
Back then, it seemed like so many other projects today that become
abandonware tomorrow. But now I see that its author, Hana Dusíková,
is the Chair of SG7 and the author of
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1433r0.pdf
and has toured the world giving talks about it, so she's an estimable
person who's unlikely to abandon this work soon.
Apparently it's been tested with gcc, clang, and msvc, and is compatible
with C++20 and proposed for C++23, so it should be pretty immune to
software rot for years to come.
> Of course, I
> don't really understand PCRE neither, but I'm pretty sure that there will
> be other people maintaining it for the foreseeable future, so from
> maintainability perspective PCRE is surely a much safer bet than CTRE.
PCRE has a couple different C++ bindings, but at a very casual glance
they seem not to be all that good. Our experience with a poor-quality
C++ wrapper (libxml++) for an eminently stable C library (libxml2)
makes me cautious. And if taking this path leads us to developing a
superior C++ wrapper (like xmlwrapp), is that worth the effort?
> GC> Does 'test_coding_rules' become fast enough if we rewrite
> GC> at least some of its expressions using this library?
>
> It almost probably will, yes.
If you would like to carry out the p1433r0-ization, I'll certainly
prioritize reviewing it.
> GC> I'm not willing to rewrite a 1300-line program in a new
> GC> language that I don't know in order to optimize nineteen
> GC> or fewer expressions.
>
> For once I did manage to predict your reaction correctly, so I'm not
> really surprised, but in my defense, I've only brought this solution up
> because the original idea to do it (in Perl) was yours.
Back then, it seemed like perl-5 was ubiquitous, and online discussions
everywhere abounded in perl one-liners. I figured that 'sed -e' and
'perl -pe' were pretty much the same thing, and we could write a 'sh'
script to use either.
But writing an entire program in a foreign language is a different
matter entirely. Perl was always idiosyncratic, and now it would seem
to have lost its lustre.
> GC> https://lists.nongnu.org/archive/html/lmi/2016-06/msg00060.html
> GC> | > FWIW replacing boost::regex with std::regex is trivial, they have
> exactly
> GC> | > the same API. However a couple of regexes will need to be adjusted
> because
> GC> | > C++11 std::regex doesn't support Perl extended syntax for some
> reason. A
> GC> | > simple grep finds only 4 occurrences of it and 3 of them use the
> syntax
> GC> | > which is supported by ECMAScript regexes and hence C++11, the only
> problem
> GC> | > is the use of "(?-s)" to turn off the "s" flag in regex_test.cpp, so
> this
> GC> | > test would need to be modified to avoid using it, but it shouldn't be
> a
> GC> | > problem.
> GC> |
> GC> | Since you know how to grep for perl regexes, I think I'd better leave
> GC> | this task to you.
> GC>
> GC> Because you've shared speed measurements, I assume you've
> GC> already made those changes locally; would you share them
> GC> with me please?
>
> I'll have to rebase my ~5 year old commit on master first, as,
> unsurprisingly, it doesn't apply cleanly, but while/when I do it, I could
> also redo the benchmarks, and maybe profiling, myself. Would you like me to
> do it or do you really want to do it on your own?
I would not prefer to do it on my own.
> GC> Until we know which regexes are actually costly, I don't
> GC> believe we understand the problem well enough to proceed to
> GC> a discussion of solutions. For example, if measurements show
> GC> that this regex is really costly:
> GC>
> GC> static boost::regex const forbidden(R"([\x00-\x08\x0e-\x1f\x7f-\x9f])");
> GC>
> GC> then an inline comment already states an alternative that
> GC> might suggest a simple, tidy solution:
> GC>
> GC> LC_ALL=C grep -P '[\x00-\x08\x0e-\x1f\x7f-\x9f]' list-of-files
>
> Shelling out to do a search from a C++ program would be the weirdest
> optimization I have ever seen...
I'm thinking not of 'test_coding_rules', but of 'make check_concinnity',
which is a conglomeration of
(1) 'sh' commands,
(2) an 'xmllint' invocation, and
(3) the compilation and execution of 'test_coding_rules'
so it's no big deal to move a test from (3) to (1).
> GC> How many regexes will prove to be really costly? Ten? Three?
> GC> Once we identify the culprits, my preferences would be:
> GC> - first, optimize the regexes themselves
[...]
> Does this mean that we shouldn't even attempt parallelizing the checks?
> It should be an almost guaranteed order of magnitude factor win (because
> modern systems have O(10) CPUs), so I'd strongly consider doing it
> independently of anything else. Do you object to this?
Interesting question. In principle, of course it's a great idea,
and I compile code in parallel all the time. In practice, the
difficulty [or so I thought] is that "external" parallelization:
test_coding_rules$(EXEEXT) src/*
would lose the summary statistics that the program gathers and
prints, which I'd much rather preserve; so it would need to be
parallelized "internally", e.g., via threads. But this is the
twenty-first century, and we have threads in the standard library.
My only objection would be that this takes effort to code. However...
...stepping back and taking a fresh look, those statistics are:
692 source files
196647 source lines
277 marked defects
which could be generated by a tiny regex-free C++ program (or a
couple of shell commands), so maybe I was mistaken, and we could
use the builtin parallelism of 'gnu make -jN' to run a new
'test_coding_rules_without_summary_statistics' against an arbitrary
list of files independently, without threading. So the only
hesitation I had evaporates, and the suggestion becomes compelling.