emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliable after-change-functions (via: Using incremental parsing in E


From: Stephen Leake
Subject: Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
Date: Wed, 01 Apr 2020 15:38:26 -0800
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (windows-nt)

Eli Zaretskii <address@hidden> writes:

> Also, direct access to buffer text generally means we must make sure
> GC never runs as long as pointers to buffer text are lying around.
> Can any Lisp run between calls to the reader function that the
> tree-sitter parser calls to access the buffer text?  

If the parser copies the text into an internal buffer, that reader
function should only be called once per call to the parser. Parsers used
to based around small buffers that would read in a file a chunk at a
time, but that is not necessary on any machine that can run Emacs. Since
Emacs has the entire file in memory, the parser can too.

However, if we are really trying to avoid copying text (which is very
premature optimization), then the reader function will be called many
times during parsing (to fetch each word), and possibly during the
grammar actions (to compute indent or face).

> Next, I'm still asking whether parsing the whole buffer when it is
> first created is necessary.  

To some extent, that depends on the language.

The parser must be able to complete a parse, to generate a complete
syntax tree. I'll assume no error correction for a moment; more below.

In C or C++ body files, "a complete parse" is typically one variable or
function declaration. So if Emacs can reliably find the beginning and
end of those declarations, it could pass just the ones containing the
region of interest to the parser. Tree-sitter (if it supports this at
all, or is modified to) would end up with a forest of small parse trees,
rather than one large one. They might get merged if large chunks of text
are parsed together.

In Ada and Java, and most C++ header files, "a complete parse" is a
file; it contains an Ada package spec or body, or a Java or C++ class,
or a C++ namespace.

There are many "small languages" for which "a complete parse" is similar
to a "statement". Bash shell, for example. They could pass just the
statement, but only if Emacs can reliably find the start and end (not
always easy).

It is also possible to modify the language grammar to allow smaller
pieces of code to be a complete parse; ada-mode does this, making a
single declaration or statement "a complete parse", in order to support
"partial parse". That can easily lead to errors in indent, since the
indent of the start of the text portion is unknown (ada-mode simply
assumes it is correct in the buffer).

Another reason to allow smaller code chunks to be a complete parse is to
allow parsing the code fragments that appear in grammar actions; the
ELPA package wisitoken-grammar-mode uses this for wisitoken grammar
files with Ada actions.

In sum, the short answer is "yes, you must parse the whole file, unless
your language is particularly simple".

Since we need to support the worst case, we should assume the whole file
must be parsed at least once.

> Can we pass to the parser just a small chunk (say, 500 bytes) of the
> buffer around the window-full to be displayed next? If this presents
> problems, what are those problems?

In wisi, the error correction code will fill in the missing text so a
complete parse is possible. Since some of that is guesses, the results
may not be very good. Tree-sitter also has error correction; I'm not
clear how good it is.

> IOW, the issue with exposing access to buffer text to modules is IMO
> secondary.  

yes, because copying text is fast compared to everything else going on.

> My suggestion is first to figure out how to do this stuff efficiently
> from within Emacs itself, as if the module interface were not part of
> the equation. We can add that aspect back later.

There are two times the wisi code that wraps the parser needs access to
the buffer; first to copy the text, second to add text properties
(faces, indent values, navigation markers). There are usually many text
properties output by each parse.

The positions and values of the text properties are computed by
functions that run after the complete syntax tree has been produced. In
wisi, those functions are added directly in the grammar source file
(where they are called "post-parse grammar actions"). In tree-sitter, I
assume they are called from some mode-author-written code that traverses
the syntax tree (wisi provides that internally). Except I see below that
the emacs tree-sitter package stores the syntax tree in the buffer.

One option here is to try to standardize on an elisp representation of a
syntax tree, and have both the wisi and tree-sitter parsers provide
that. Then the grammar actions could be implemented in elisp. I suspect
that would be very slow; elisp is just not good at traversing large
complex data structures. That is not just my bias showing (I _much_
prefer doing as much as possible in Ada); I first wrote the ada-mode
parser and grammar actions in elisp, and then did a complete rewrite in
Ada, gaining significant speed. Although I never considered passing the
syntax tree to elisp as a single object, so maybe that could work well.

There is no universal standard for representing "a syntax tree". In
wisi, the tree is directly produced by the LR shift and reduce
operations, and thus is very close the the grammar expressed in BNF. I
don't know what the tree-sitter parse tree looks like. AdaCore provides
a parser similar in purpose to the wisi parsers
(https://github.com/AdaCore/libadalang), that also does more of what an
Ada compiler does (which could allow even better font-lock and navigation).
To support those additional operations, the syntax tree is quite
different from the ada-mode one.

In general, each parser library, and even each grammar author, will have
different representations for the syntax tree.

So if we want to support different parsers, I think it is best to define
the Emacs "parser API" as "give text to parser; accept text properties
from parser".

LSP (via eglot) provides other things the parser can return; code
completion menus, for example. And for indent and face, it returns
formatted text with markdown. I plan to translate that to text
properties to integrate LSP into wisi. Whether LSP requires a full
initial parse is up to the LSP server author (LSP itself provides both
"here's the full text" and "here's partial text" messages); they have
the same considerations discussed above.

> And yes, doing this by consing strings is not a good idea, it will
> slow things down and cause a lot of GC.  It is best avoided.  Thus my
> questions above.

I'm not sure how "convert syntax tree to elisp" compares to "consing
strings". I would certainly expect it to cause a lot of GC.

>> > Btw, what do you do with the tree returned by the tree-sitter parser?
>> > store it in some buffer-local variable?  If so, how much memory does
>> > such a tree take, and when, if ever, is that memory released?
>> >
>> 
>> It's stored in a buffer-local variable. I haven't measured the memory
>> they take. Memory is released when the tree object is garbage-collected
>> (it's a `user-ptr').

Is it an elisp structure (or accesible from elisp)? Have you written
code that traverses it to provide faces and indentation?

-- 
-- Stephe

PS; I have the beginnings of a migraine while typing this, so some of it
may not make sense. Sigh.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]