Re: [lmi] Best way to integrate PCRE

lmi
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Best way to integrate PCRE

From:	Greg Chicares
Subject:	Re: [lmi] Best way to integrate PCRE
Date:	Fri, 27 Aug 2021 16:21:44 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0
On 8/21/21 6:37 PM, Vadim Zeitlin wrote:
> On Wed, 28 Jul 2021 19:45:30 +0000 Greg Chicares <gchicares@sbcglobal.net> 
> wrote:
> 
> GC> On 2021-07-28 18:45, Vadim Zeitlin wrote:
> GC> > On Wed, 28 Jul 2021 16:10:34 +0000 Greg Chicares 
> <gchicares@sbcglobal.net> wrote:
> GC> 
> GC> [...simply withdraw support for msw builds of 
> 'test_coding_rules{EXE_EXT}'...]
> GC> 
> GC> >  Note that this will also require making parts of makefiles conditional 
> on
> GC> > the target host, which wasn't the case so far.
[...]
> GC> That doesn't change the decision, strategically. Tactically, I'd favor
> GC> a minimalist approach along the lines of this untested patch:
> GC> 
> GC> --------8<--------8<--------8<--------8<--------8<--------
[...abbreviated]
> GC> +  ifeq (x86_64-pc-linux-gnu,$(LMI_TRIPLET))
> GC> +    default_targets += \
> GC> +      test_coding_rules$(EXEEXT) \
> GC> +
> GC> +  endif
> GC> -------->8-------->8-------->8-------->8-------->8--------
> 
>  Thanks for this idea! I've followed it initially but, after much
> toing-and-froing, finally decided to use a different approach: I still
> always build test_coding_rules${EXEEXT}, but it just outputs an error if
> it's built under MSW.
> 
>  I believe this is the best solution because
> 
> - It results in more clear error under MSW if something attempts to run
>   test_coding_rules there.
> 
> - It avoids any possibility of breaking something else because
>   test_coding_rules executable is not available when it's supposed to be.
> 
> - It still allows to build the fully functional version under MSW by simply
>   predefining LMI_HAS_PCRE if you do have PCRE library there.
> 
> but please let me know if you disagree before I submit my patch.

I'm not as close to this as you are, so it's hard for me to say
whether one strategy or another is better. But I can speak in
terms of outcomes. For instance:

> GC> We'd also need to change 'hooks/pre-commit', e.g. (untested):
> 
>  With my solution above this is not necessary any more as it would run and
> just output an error under MSW.

Then I guess it might simply emit a message like:
  "'test_coding_rules' not implemented for msw."
That's okay with me if I deliberately try to run it under msw.
But of course I wouldn't want that outcome otherwise. Let me
try to say that more precisely:

(1) In this scenario:

  $ LMI_COMPILER=gcc ; LMI_TRIPLET=i686-w64-mingw32    ; . 
/opt/lmi/src/lmi/set_toolchain.sh
  $ $make test_coding_rules.exe

if I get a useless binary that only says "not implemented",
that's okay.

(2) In a scenario where I may have done this several months ago:

  $ LMI_COMPILER=gcc ; LMI_TRIPLET=i686-w64-mingw32    ; . 
/opt/lmi/src/lmi/set_toolchain.sh

and then, having forgotten about that, I do this:

  $ make $coefficiency check_concinnity 2>&1 |less -S -N

then I want a concinnity check, not a "not implemented" message.

How could that be accomplished? IIRC, years ago we did that for
$(XMLLINT). Grepping for that, I find ('grep' output indented):

  posix_fhs.make:XMLLINT := xmllint

pc-linux-gnu: Just use the system binary.

  msw_cygwin.make:XMLLINT := $(localbindir)/xmllint

cygwin: Use a locally-built binary. The idea was that we always
build libxml anyway, obtaining this as a convenient by-product;
but we either don't have a system binary for it, or that system
binary might be out of date.

  msw_generic.make:#   XMLLINT := $(PERFORM) $(localbindir)/xmllint
  msw_generic.make:    XMLLINT := xmllint
  
"generic" (meaning 'wine', I guess): We have an architecture-specific
build already, but we choose not to use it--IIRC, because the debian
binary is faster.

Just now I looked for "architecture-independent" in FHS-3.0 (FWIW)
and it looks like only /usr/share and /usr/local/share are
designated that way--but they seem to be intended for data.
As for binaries, if I type
  $ grep
I expect to get the operating system's 'grep', say, '/bin/grep';
that's the kind of behavior I want to achieve--i.e., if I type
  $ make $coefficiency check_concinnity
then I just want a concinnity test: if it's identical for all
architectures, that's okay, because its outcome shouldn't vary by
architecture anyway. I guess the place to install that is
  $(prefix)/bin
unless we want to add a new directory for such things, e.g.
  $(prefix)/sbin
for binaries built with one and only one "dominant" architecture.

(If we like both of those, I'd actually prefer $(prefix)/sbin/
because it wouldn't be obliterated by 'make uninstall'. It could
be objected that 'sbindir' should be the architecture-dependent
  $(exec_prefix)/sbin
to which I would reply that we could use an idiosyncratic
directory, say:
  # dominant-architecture binaries to be used always
  xbindir := $(prefix)/xbin
which we'd add to $PATH in all applicable scripts and makefiles.)

>  Finally, the real reason for writing this email, is that I've realized I
> forgot to ask another question: what do you think should be done with
> regex_test.cpp?

Retain it, removing boost options with no standard parallels.

> In my current version, it was changed to use std::regex
> instead of boost::regex and while I could just leave it like this, I don't
> think it makes much sense to have unit tests for the functionality provided
> by the standard library.

It's more than that. For example, test_input_sequence_regex() is captioned:

  /// Motivation: to validate data from external systems. To facilitate
  /// maintenance of xml schemata, a regex is constructed and displayed
  /// for every lmi sequence type.

Other tests demonstrate
  /// the high cost of line-by-line searching in a vector of strings
for std::string::find() vs. regex searches for text that occurs early,
late, and never in a sizable string. This is useful information.

Other code references these tests:
  /// Add a newline at the beginning of the string, and require
  /// a newline at the end, so that "\n" can be used in regexen
  /// instead of '^' and '$' anchors--see 'regex_test.cpp'.
and we don't want to leave those references dangling. Generally,
I prefer to leave well enough alone.

> And while it did make sense to compare Boost.Regex
> performance with different combinations of "m" and "s" modifiers, such
> modifiers are not supported by std::regex, so these tests don't make much
> sense neither.

Apparently boost implemented "m" and "s" but std::regex did not. AFAICT:
 - "m" toggles whether '^' and '$' match on newline boundaries, and
 - "s" toggles whether '.' matches a newline.
I've never thought about such issues except when writing regexes in
'test_coding_rules', a long time ago. I remember that these issues posed
challenges then, but I've never written things like this fluently, and
I've never studied std::regex in detail.

Is it planned to add "m"? See:

https://cplusplus.github.io/LWG/issue2503
| multiline
| Specifies that ^ shall match the beginning of a line and $ shall match
| the end of a line, if the ECMAScript engine is selected.

If so, then is contains_regex2()...

  /// Match a regex as with Perl's '-s'.
  bool contains_regex2(std::string const& regex)
  {return boost::regex_search(text, boost::regex("(?-s)" + regex));}

...the only thing we should consider removing?

> The only way to approximate these modifiers that I found was
> to use basic regex syntax (as opposed to the ECMA script default one), but
> it doesn't seem really useful to compare them as we're almost certainly not
> going to use the basic syntax for anything.

I don't pretend to understand all of this, so let me ask: why wouldn't
we ever use BREs?

Is the "basic" syntax what's tested by contains_regex1(), which today says:
    boost::regex const r(regex, boost::regex::sed);
? The comments preceding that function seem to explain why it's not
necessarily a good idea. If that's exactly your point, then it would
be helpful, for me personally, to retain that unit test, because I
probably never will memorize the rationale--the test's purpose then is
not to test std::regex's correctness, but rather to explore alternatives
and document why some are not used.

> OTOH it could be useful to
> compare std::regex performance with PCRE2, but this would require adding
> tests for LMI_HAS_PCRE to this test as well and I don't know if it's really
> worth it.

If that's not difficult, I think it would be most illuminating, and
needing to add LMI_HAS_PCRE doesn't seem like a significant disadvantage.

>  My personal preference would actually be to just remove this test
> completely because I don't think it's very useful, but I guess you
> would disagree with this

I strongly prefer to keep as much of it as possible, discarding only
any boost-specific features that aren't in std::regex and aren't
proposed to be added to std::regex.

> But if we keep it, should we use std::regex in the "unit test" part of it,
> or PCRE?

Why not both?

> And what should we be comparing in the "benchmark" part of the
> test?

Why not compare everything we use, and every available thing that
we chose not to use?

>  Please let me know if you have any preference or if you think we can just
> leave this test be with the minimum changes required to make it compile
> with std::regex instead of boost::regex and maybe improve (or remove) it
> later.

Minimal changes for now: okay.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [lmi] Best way to integrate PCRE, Vadim Zeitlin, 2021/08/21
- Re: [lmi] Best way to integrate PCRE, Greg Chicares <=
Prev by Date: Re: [lmi] Best way to integrate PCRE
Previous by thread: Re: [lmi] Best way to integrate PCRE
Index(es):
- Date
- Thread