emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: rx.el sexp regexp syntax (WAS: Off Topic)


From: Alan Mackenzie
Subject: Re: rx.el sexp regexp syntax (WAS: Off Topic)
Date: Fri, 25 May 2018 18:17:10 +0000
User-agent: Mutt/1.9.4 (2018-02-28)

Hello again, Pierre.

On Fri, May 25, 2018 at 18:47:59 +0200, Pierre Neidhardt wrote:

> Alan Mackenzie <address@hidden> writes:

> >> rx.el is one of the best concepts I've discovered in a long time.
> >> It's another instance of "Don't come up with a new (mini)language when
> >> Lisp can do better": it's easier to learn, more flexible, easier to
> >> write, much easier to read and as a consequence much more maintainable.

> > Much easier than what?  Than the putative mini-language that doesn't get
> > written?

> I meant that in my opinion rx is easier to write than regexps.  That it
> is not popular is the root of the question here.

I think it will be easier only for beginners.

> >> I think it's high time we moved away from traditional regexps and
> >> embraced the concept of rx.el.  I'm thinking of implementing it for
> >> Guile.

> > There's nothing stopping anybody from using rx.el.  However, people have
> > mostly _not_ used it.  The "I think it's high time ...." suggests in
> > some way forcing people to use it.  Before mandating something like
> > this, I think we should find out why it's not already in common use.

> Sorry if you felt I was forcing, that wasn't my intention.  I was
> referring to the long period regexps have been around.

> I thought the reason it's not already in common use had already been
> discussed: it's barely referenced anywhere, it needs more advertising.

> Correct me if this is wrong.

It may be part of the explanation.  But more salient, I think, is that
hackers prefer powerful means of expression.  A single character in a
string regexp has the power of a sexp in the corresponding rx regexp.
Paul Graham (at http://www.paulgraham.com) has had quite a bit to say
about this in the (distant) past.  Conciseness of expression is where
it's at.

> >> At the moment the rx.el implementation is built on top of Emacs regexps
> >> which are implemented in C.  I believe this does not use the power of
> >> Lisp as much as it could.

> > But would any alternative use the power of regexps?

> Yes, rx.el is a drop-in replacement of regexps.  What do you mean?

I'm not sure, any more.  Sorry.

> > Emacs has a (moderately large) cache of regexps, so that building the
> > automatons is done very rarely.  Possibly just once each for each
> > session of Emacs.

> That's the whole point: if possible (see below), remove the requirements
> for regexp cache management.

I don't think that would be wise.  Manipulating the cache is far faster
than generating the automatons at each use.

[ .... ]

> >> The rx.el library/concept could alleviate this issue altogether: because
> >> we express the automaton directly in Lisp, the parsing step is not
> >> needed and thus the building cost could be tremendously reduced.

> >> So the rx.el building steps

> >>   rx expression -> regexp string -> C regexp automaton

> >> could boil down to simply

> >>   rx automaton

> > I don't see what you're trying to save, here.  At some stage, the regexp
> > source, in whatever form, needs to be converted to an automaton.

> Yes, that's what I meant with "rx automaton".  My suggestion (not
> necessarily for Emacs Lisp) is to remove the step that converts the rx
> symbolic automaton to a string, and the conversion from a string to the
> actual automaton.

OK.  That would save only a little, at automaton building time, which
likely would happen just once in any Emacs session.

> > Are you suggesting here building an interpreter in Lisp directly to
> > execute rx expressions?

> Yes, but maybe in Guile or some other Lisp.  Don't know if it's feasible
> in Emacs Lisp.

> >> It would be interesting to compare the performance.  This also means
> >> that there would be no need for caching on behalf of the supporting
> >> language.

> > I will predict that an rx interpreter built in Lisp will be two orders
> > of magnitude slower than the current regexp machine, where both the
> > construction of an automaton, and the byte-code interpreter which runs
> > it are written in C (and probably quite optimised C at that).

> Obviously, and this is the prime reason why the author of rx.el
> implemented it on top of C regexp.  My point was that with a fast Lisp
> (or a specifically designed C support), a Lisp automaton would be just
> as fast: the Lisp code would directly map the equivalent C automaton.

> Again, I have no clue if that's doable in Emacs Lisp.

It might be.  But it might be a lot of work for little benefit.

> > I can't get excited about rx syntax, which I'm sure would be just as
> > tedious, and possibly more difficult to read than a standard regexp.

> Have you used rx?

No.  Neither have I used Cobol (much).

> The whole point of the library is to increase readability, and it does
> a great job at it in my opinion.

You seem to want to increase the readability for beginners, for people
who have laboriously to slog through an expression trying to make sense
of each bit of it.  I don't think experienced regexp users have
difficulty with the syntax.  I don't, for one.

There was a time when people thought that

    ADD 1 TO A GIVING B

was more readable than

    b = a + 1;

, and generations of programmers suffered as a result.

> > Analagously, as a musician, I read standard musical notation (with
> > sets of five lines and dots) far more easily and fluently than I could
> > any "simplified" system designed for beginners, which would be bloated
> > by comparison.

> rx.el is meant to be "simplified for beginners".  You could also reverse
> the analogy in saying that regexps are the "simplified version for
> beginners"... The analogy does not map very well.

> A better analogy would be the mapping between assembly and the
> hexadecimal codes of CPU instructions: I don't think many people find
> hexedecimal codes more explicit than assembly verbs and symbols
> (although most assembly languages abuse abbreviations, but the
> intention is there).

Hexadecimal CPU codes aren't and aren't intended to be human-readable.
String regular expressions are.

> > Regular expressions can be difficult.  I don't believe this difficulty
> > lies, in the main, in the compact notation used to express them.  Rather
> > it lies in the concepts and the semantics of the regexp elements, and
> > being able to express a "mental automaton" in regexp semantics.

> The semantic between rx and regexp does not differ.  It's purely
> syntactical.

Yes.

> Let's consider some points:

> - rx can be written over multiple lines and indented.  This is a great
>   readibility booster for groups, which can be _grouped_ together with
>   linebreaks and indentation.

rx MUST be written over several lines and indented.  A string regexp, by
contrast, usually fits onto a single line.

> - rx does not require escaping any character with backslashes.  This
>   is always a great source of confusion when switching from BRE to ERE,
>   between different interpreters and when storing regexp in Lisp strings
>   where backslashes must be escaped themselves for instance.

It is an inconvenience, yes, but I think you're exaggerating its
importance somewhat.  In rx, literal characters have to be "escaped" by
string quotes.  This might be an irritation.

> - Symbols with non-trivial meanings in regexp (e.g. \<, :, ^, etc.) have
>   a trivial _English_ counterpart in rx: (respectively "word-start",
>   nothing, "line-start" _and_ "not").

The "English" counterpart used in rx is bulky and difficult to learn.
Somehow, you've got to learn that it's "word-start" and not
"word-beginning", that it's "not" and not "non", and so on.  This is more
difficult than just learning \< and ^.  If your native language isn't
English, it might be much more difficult.

> - No more special-case symbols like "-" for ranges or "^" (negation when
>   first character in square brackets).  Thus less cognitive burden.

That remains in dispute.

> - The "^" has a double-meaning in regexp: "line-start" and "not".

Yes, it is context dependent.  I don't think this causes confusion in
practice.

> The list goes on.

Well, so far, on this list, two or three people have said they "like"
rx.el.  Nobody has said "I'm going to be using rx.el in my programs from
now on".  I don't think they will.

We'll see.

> --
> Pierre Neidhardt

-- 
Alan Mackenzie (Nuremberg, Germany).



reply via email to

[Prev in Thread] Current Thread [Next in Thread]