lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Continuing deboostification with removing dependency on Boost.Rege


From: Vadim Zeitlin
Subject: [lmi] Continuing deboostification with removing dependency on Boost.Regex
Date: Thu, 27 May 2021 19:04:26 +0200

 Hello,

 Now that we don't use Boost.Filesystem any more, the only other compiled
Boost library that we still require is Boost.Regex and it would be nice to
remove the dependency on it as well (spoiler: if we can do it, we should be
also able to get rid of the dependencies on a couple of other, header-only,
Boost libraries used in 4 of lmi tests relatively easily and then become
totally Boost-free).

 The problem with doing it is that the obvious solution, which consists in
replacing boost::regex<> with std::regex<>, has an important problem in
practice: the latter class is much slower (in all implementations, but
especially in libstdc++, which lmi uses for the official builds), see e.g.
https://lists.gnu.org/archive/html/lmi/2016-07/msg00028.html for some
benchmarks. Unlike back in 2016, I now know the reasons for the difference
in performance between {boost,std}::regex but, unfortunately, knowing that
it's due to the ABI compatibility constraints that the std implementation
labours under, doesn't really help because it just means that this problem
is not going to be resolved any time soon (i.e. definitely not until C++23,
and maybe not even then).

 So I'm afraid there is still no perfect solution, just several imperfect
ones. But, unlike in 2016, I think that now the benefit/cost analysis still
favours removing Boost.Regex because it paves way for removing dependency
on Boost entirely in the near future, and so I think we should choose one
of the solutions below -- unless you can propose a better one, of course.
For now here are the only possibilities I see:


0. Still switch to std::regex, but optimize test_coding_rules to bring the
   absolute numbers down to something more reasonable than 30s under Linux
   and 90s under Wine that I saw in my benchmarks.

   + This is the simplest solution conceptually.
   + In the long run there is still hope that std::regex will improve
     and when it does, this should obviously be the best approach.
   - I can't guarantee that we're going to be able to gain a factor of
     6 under Linux (although I think we should, when checking all files,
     just by parallelizing the checks to more than 6 CPUs) and I'd be quite
     surprised if we gained a factor of 80 under Wine (although I wouldn't
     be that surprised if we managed to do it, it should be possible to
     significantly improve the match time of at least some regexes but, as
     usual, the proof is in the pudding, so I won't know how much does it
     help before actually doing it), so this still risks being slower than
     the current version -- although hopefully not as horribly slow as
     right now.
   - Optimizations will probably make the code more complicated than it is
     now, even if we'll try to keep it as simple as possible -- but there
     is not much scope for making the code faster by simplifying it here,
     so all changes will only move it in the direction of greater
     complexity, unfortunately.


1. Switch to some other C++ regex library. My favourite one remains PCRE
   (or PCRE2 by now, but this doesn't change much). Ideally I'd actually
   like to switch to using PCRE in wxWidgets (which currently, and since a
   very long time, uses its own fork of Henry Spencer's regex library which
   is less fully-featured, slower and, the worst thing, completely
   unmaintained), which would include it as submodule, and then we could
   easily use the version already built as part of wxWidgets from lmi too
   (i.e. I do _not_ propose using wxRegEx, as you'd be totally justified to
   balk at introducing a dependency on wx in test_coding_rules).

   + We should get performance as good as with Boost.Regex or better.
   - We would still depend on a 3rd party library.
    ± But: PCRE is a C library and so building it is simpler than building
      Boost.Regex and shouldn't need to be updated with every gcc/C++
      version update.
    ± It could be entirely hidden in wx build process, as discussed above.
   - This would require more work than just doing s/boost/std/g.


2. Rewrite test_coding_rules in some other language and use its built-in
   regex support. Originally you had expressed interest in rewriting it
   it Perl, but I think the main argument for it was that Perl was already
   available in the MSW build environment due to Git dependency on it. Now
   that we only use Debian for building, I don't think this argument is as
   important as it was, and so IMHO we should rather use whichever language
   you would feel the most comfortable with, provided that it can be
   apt-got easily. Which means that we could use Raku (https://raku.org/),
   which, just like Perl, has the advantage of not requiring to be built
   ("scripting language"), which IMO is important for a program like this
   which may need to be used independently of lmi build system (remember
   the Git hook discussion); Rust (https://www.rust-lang.org/), which does
   need to be compiled but due to its built-in build system it's as simple
   as a single command and it's still separate from the main lmi build
   system, so it's still somewhat advantageous; or even something like Nim
   (https://nim-lang.org/) which can be both executed as a script or
   compiled into native code.

   The main question is how to do it in a satisfactory way from your point
   of view, i.e. how to allow you to easily modify the existing tests and
   add new ones without overburdening you with the details of the language
   used to implement them. You had proposed totally separating the
   "language" part from the tests part, but I just don't think that it's
   realistically possible to do this completely. We could, of course, have
   a table of regexes to apply, and maybe even put the conditions for
   applying them into the same table, but I don't think we can express all
   of the tests in this way. IOW to really separate the checks themselves
   from the rest, we'd need some DSL for describing these checks and while
   all the languages listed above do allow defining DSLs, it's not obvious
   at all to me that such a DSL would be simpler than the language it is
   used from.

   So I'm afraid you'd still have to know at least something about the
   language this program is written in and not recoil in horror whenever
   you see it. From this point of view, using a statically-typed language
   (fully so, such as Rust or Nim, or even only partially so, such as Raku)
   would be IMNSHO preferable to using a dynamically-typed language such as
   Perl (or Python or, should you consider joining the dark side,
   JavaScript), as it's always better to get compilation rather than
   run-time errors when doing something you're not quite sure about. But,
   really, it's a question for you, as only you can decide whether you're
   going to be comfortable with any of them or none at all and I'd
   recommend at least looking at the examples shown in the documentation
   to make your mind about them, here are some more links:

   - Raku: several links from https://www.raku.org/resources/
   - Rust: https://doc.rust-lang.org/stable/rust-by-example/
   - Nim: https://learnxinyminutes.com/docs/nim/ or 
https://nim-lang.org/docs/tut1.html

   If you do manage to convince yourself that you might be, I think this
   solution has many advantages:

   + Code should be significantly shorter and more clear in any other
     language providing higher level operations than C++.
   + Scripting languages make it easier and faster to experiment with the
     changes.
   + Parallelizing the checks will be much simpler and safer too:
     * Rust claims to allow for "fearless concurrency", and it's not an
       idle boast: multithread-unsafe code does not _compile_ in Rust,
       which is very nice, if you ask me.
     * Of course, Raku provides higher level facilities which make it
       impossible to write multithread-unsafe code in the first place,
       which is arguably even nicer.
   + Performance should be pretty good:
     * Nim is based on the same PCRE as I'd like to use for C++ anyhow.
     * Rust has a fast build-in regex implementation (used by ripgrep,
       a.k.a. rg (https://github.com/BurntSushi/ripgrep), which is one grep
       replacement that I've finally switched to after considering
       switching to ack, ag, ... in the past -- but they just didn't seem
       sufficiently better than grep to justify it, while rg definitely did
       seem, and is).
     * I'm more doubtful about Raku, as it's regex support is crazy
       powerful and nice to use, but not necessarily known (at least to me)
       for its performance prowesses.
   - Just the one discussed above: the program will be written in another
     language, different from C++ and APL, and you would still need to know
     at least something about this language in order to use it.
     * From this point of view, Nim almost certainly has the least steep
       learning curve, which is why I include it, even though it's not a
       language I know myself particularly well or have much experience
       with -- but have heard good things about and my limited contact with
       it was pleasant enough.
     * Rust is IMNSHO the most promising one as I think it will replace C++
       for a lot of projects the latter is currently used to.
     * Raku is IMHO the most fun to use.


 While I'm writing all this because I'd like you to make a choice, I can't
prevent myself from writing what my own choice would be and also what
recommendation I'd like to make. I just hope you don't draw any
unfavourable conclusions about my psychological state from the fact that
the two directly contradict each other, because:

- My own first choice would definitely be to rewrite test_coding_rules in
  another language because I see no reason to do it in C++ and am sure that
  it would be smaller, simpler and, with std::regex, significantly faster
  after rewrite.

  My second choice would be to switch to using PCRE. It's somewhat a
  self-interested one because I'd like to gain experience using PCRE in
  order to use it in wx, but its appeal also lies in not having to depend
  on vagaries of std::regex implementations and PCRE is, more or less, the
  industry standard and I'm quite confident in it being stable enough for
  the foreseeable future.

  Working on optimizing the existing code would be my last choice because
  this would complicate the code even further, rather than simplify it, as
  with the first choice, and I don't really see any compelling interest in
  doing this.


- But from your point of view, I think what you really want is to make the
  least amount of changes possible allowing to get rid of Boost.Regex
  without suffering a catastrophic performance drop. And, if so, starting
  with optimizing the existing code with std::regex probably is the best
  thing to do because if it solves the problem well enough (i.e. we get a
  50% slowdown rather than a 50x one), it would require the least effort
  on your part.

  If this optimization effort fails, or if you decide that the new and
  faster version is too complicated, switching to PCRE would be the next
  least intrusive thing from your point of view, especially if we do use
  the version included as submodule in wxWidgets.

  And, finally, I have the impression that you only agreed to use Perl
  under duress when we discussed this the last time and while any of the
  languages discussed above are more readable than Perl, I suspect you
  might still be only very cautiously enthusiastic about any of them.


 However I could well be wrong in my attempts to read your mind, so maybe
your actual point of view is completely different. It would be great if you
could please share it and express your thoughts about this matter and take
the decision about what would you like to do. Again, none of the proposals
above is perfect, but I think they're all better than doing nothing and
remaining stuck with Boost.Regex or switching to std::regex and making the
program unusable.

 Thanks in advance!
VZ

Attachment: pgpNmHn0UOy8p.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]