[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
From: |
Mattias Engdegård |
Subject: |
bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) |
Date: |
Fri, 5 May 2023 18:26:52 +0200 |
5 maj 2023 kl. 12.31 skrev Ihor Radchenko <yantar92@posteo.net>:
> Not exactly. The actual statistics is the following (of course, it is a
> subject of the actual parsed file structure).
>
> Below, I measured time spent in different branches of cond.
This is useful. It looks like drawers consume a lot of time, and list items. I
know very little about Org, but from afar it looks like all drawers have the
same basic form. Can't you recognise them with a single regexp and then branch
on the drawer type for subtype-specific treatment?
There are micro-inefficiencies in the regexps here and there that you might
want to try fixing (although I can't promise any noticeable gain from doing so):
(defconst org-property-drawer-re
(concat "^[ \t]*:PROPERTIES:[ \t]*\n"
"\\(?:[ \t]*:\\S-+:\\(?:[ \t].*\\)?[ \t]*\n\\)*?"
"[ \t]*:END:[ \t]*$")
Look at the middle line in particular. Translated to rx, that part becomes
(*? (* (in "\t "))
":" (+ (not (syntax whitespace))) ":"
(? (in "\t ") (* nonl))
(* (in "\t "))
"\n")
There are too many ways this could match. Maybe you could change it to
(*? (* (in " \t"))
":" (+ (not (in " \t\n:"))) ":"
(* nonl)
"\n")
which prevents a lot of unnecessary backtracking and does away with parsing
structure that doesn't matter here.
Another example:
(defconst org-drawer-regexp "^[ \t]*:\\(\\(?:\\w\\|[-_]\\)+\\):[ \t]*$"
which is
(: bol
(* (in "\t "))
":"
(group (+ (| wordchar (in "_-")))) ; <--
":"
(* (in "\t "))
eol)
Making reasonable assumptions about characters, the line marked with an arrow
could become
(group (+ (not (in " \t\n:"))))
but it's fine if you want to exclude more characters here, as long as you avoid
leaving backtrack points everywhere. (Character syntax is kind of expensive
too.)
Regarding list items, are you still calling (org-item-re) each time?
>> Now if as you suggest the parsing is dominated by sequences of regexps in
>> the branches, it prompts the questions: which branches, what regexps, why
>> are there so many of them, and is there anything that can be done to reduce
>> their number?
>
> Oh. No. The parsing is dominated by `org-element--current-element'. I
> can clearly see it because the profiler hits
> `org-element--current-element', not the branches.
Well there must be regexps being matched elsewhere since you did show early on
the working set to be above 40, not the ca. 20 in org-element--current-element.
> I just had no idea what to make of your suggestion about
>
> Run on a reduced dataset, and see if the sequence of regexps being
> exercised, and their frequencies, are consistent with what you
> expect.
Stupid printf-debugging actually, nothing fancier than that.
I'll see if I can put together a patch for you a bit later on.
> (looking-at
> (rx
> (or
> (group-n 1 (regexp org-element--latex-begin-environment))
> (group-n 2 (regexp org-element-drawer-re))
> (group-n 3 (regexp "[ \t]*:\\( \\|$\\)"))
> (group-n 7 (regexp org-element-dynamic-block-open-re))
> (seq (group-n 4 (regexp "[ \t]*#\\+"))
> (or
> (seq "BEGIN_" (group-n 5 (1+ (not space))))
> (group-n 6 "CALL:")
> (group-n 8 (1+ (not space)) ":")
> ))
> (group-n 9 (regexp org-footnote-definition-re))
> (group-n 10 (regexp "[ \t]*-\\{5,\\}[ \t]*$"))
> (group-n 11 "%%("))))
This actually incurs some unnecessary run-time cost: the (regexp ...) forms
make this expand to a `concat` call to construct this rather long regexp each
time. Either only recompute it when any of the variables
(org-element--latex-begin-environment etc) change, or if you intend them to be
compile-time constants, make sure they are expanded as such.
> is actually slightly slower overall compared to a series of `looking-at-p'.
> AFAIU, because the `looking-at' needs to allocate match-data vector for
> all these 11 groups, which leads to
> ;; 6.78% emacs emacs [.]
> process_mark_stack
> floating up in the perf top.
Quite sure that's the concat calls. Match data doesn't actually contribute to
any GC-level consing unless you reify it by calling `match-data`, or indirectly
through `safe-match-data` (which I see that you are using in several places --
try not to).
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), (continued)
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/02
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/02
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/03
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/03
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/03
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/03
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/03
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/03
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/04
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/05
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c),
Mattias Engdegård <=
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/06
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/07
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/08
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/08
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/08
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/08
- bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)), Ihor Radchenko, 2023/05/09
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/09
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Mattias Engdegård, 2023/05/09
- bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), Ihor Radchenko, 2023/05/09