bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)


From: Ihor Radchenko
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Tue, 02 May 2023 21:21:39 +0000

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> The Org parser is basically a giant `cond' of a number of regexp
>> matches. See `org-element--current-element'.
>
> A common way to handle this is to build a big regexp to match many cases at 
> the same time, essentially transforming
>
> (cond ((looking-at RE1) ...)
>       ((looking-at RE2) ...)
>       ...)
>
> to
>
>     (looking-at (rx (or (group RE1) (group RE2) ...)))
>     (cond ((match-beginning 1) ...)
>           ((match-beginning 2) ...)
>           ...)
>
> This reduces the number of regexps used and is also typically faster.
> (Essentially this is what `syntax-propertize-rules` does but in a more 
> specialised context.)

I tried this, and it did not give any noticeable improvement.
Most likely, because the actual `cond' is

  (cond ((looking-at "foo) (<parser function calling many more regex 
searches>)) ...)

Actually, I started looking into C-level perf data right after trying to
consolidate the regexps into one giant looking-at form and not seeing
any improvement. At that point, I already cached most of the
dynamically generated regexps in there and ran out of reasonable ideas.

> Using tree-sitter for this could very well be even faster but it's not 
> guaranteed to be available.

The available tree-sitter grammars for Org are about 1.5-2x faster in my
previous tests, but they do less granular parsing of fields. And not
complete. Org is not context-free and does not fit well into GLR.

And we are not going to use tree sitter for development to avoid
increasing contribution barriers.

> Otherwise it's very much a matter of optimisation of everything, including 
> regexps. Minimise backtracking.
> If you want to match five or more dashes, use "------*" instead of 
> "-\\{5,\\}". And so on.

This example sounds like something that regexp compilation should be
able to optimize, no? I do not easily see why the latter should cause
more CPU time compared to the former.

> It's also obviously a good idea not to generate regexps dynamically each time 
> if you can help it, and minimise consing in general.

Sure. I was able to shave a few seconds off using this idea.
Other than regexp compilation hotspot, I only see re-writing parser
non-recursively (`org-element--parse-elements' and
`org-element--parse-objects').

>>> Introducing regexp objects that could store compiled regexps and be
>>> used instead of strings would be quite some work but probably
>>> worthwhile.
>> 
>> What exactly needs to be done? Assuming that regexp objects are not
>> going to be readable, for simplicity.
>
> A proper design, for starters. For example, we probably want them to be 
> usable in customised variables which calls for them to be readable.

Or, alternatively, the parsed regexps can be attached to string objects
internally. Then, regexp cache lookup will degenerate to looking into a
string object slot.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





reply via email to

[Prev in Thread] Current Thread [Next in Thread]