Re: Reliable after-change-functions (via: Using incremental parsing in E

El mié., 1 de abril de 2020 22:22, Tuấn-Anh Nguyễn <address@hidden> escribió:

On Thu, Apr 2, 2020 at 2:33 AM Eli Zaretskii <address@hidden> wrote:
>
> > From: Tuấn-Anh Nguyễn <address@hidden>
> > Date: Thu, 2 Apr 2020 00:55:45 +0700
> > Cc: address@hidden
> >
> > > Did you consider using the API where an application can provide a
> > > function to return text at a given offset? Such a function could be
> > > relatively easily implemented for Emacs.
> > >
> >
> > I don't understand what you mean. Below I'll explain how it works
> > currently. [...] If dynamic modules have direct access to the
> > buffer text, none of the above is an issue.
> >
> > Such direct access can be enabled by something like this:
> >
> > char* (*access_buffer_text) (emacs_env *env,
> > emacs_value buffer,
> > ptrdiff_t byte_offset,
> > ptrdiff_t *size_inout);
> >
> > Of course, such an API would require extensive documentation on how it
> > must be used, to ensure safety and correctness.
>
> I think you are moving too fast, and keep the current implementation
> in sight too much.
>

I'm actually moving too slow here. I have thought about this part quite
a bit, but I'm currently focusing on other things, partially because
this is not painful bottleneck.

> What I suggest is to step back and see how such direct access, if it
> were available, could be used with tree-sitter. Let's forget about
> modules for a moment and consider tree-sitter linked with Emacs and
> capable of calling any C function in core. How would you use that?
>
> Buffer text is not exactly UTF-8, it's a superset of UTF-8. So one
> question to answer is what to do with byte sequences that are not
> valid UTF-8. Any suggestions or ideas? How does tree-sitter handle
> invalid byte sequences in general?
>

I haven't checked yet. It will probably bail out, which is usually the
desired behavior. The tree-sitter's author is likely open to making this
behavior configurable here, though. Alternatively, the direct access
function can offer different behaviors: as-is, bail-out, skip-over, or
null-out (tree-sitter will skip over null bytes, IIRC).

> Also, direct access to buffer text generally means we must make sure
> GC never runs as long as pointers to buffer text are lying around.
> Can any Lisp run between calls to the reader function that the
> tree-sitter parser calls to access the buffer text? If so, we need to
> take care of that issue.
>

With direct access, no Lisp code will be run between these calls.

> Next, I'm still asking whether parsing the whole buffer when it is
> first created is necessary. Can we pass to the parser just a small
> chunk (say, 500 bytes) of the buffer around the window-full to be
> displayed next? If this presents problems, what are those problems?
>

In principle (not in tree-sitter ATM), and in very specific cases, yes.
IMO that's the wrong focus on a premature optimization anyway. As others
noted, even in the pathological case of xdisp.c, the performance is
acceptable. Also keep in mind that syntax highlighting is just one
application. Other use cases usually want a full parse tree.

If we really want to tackle this issue, there are other approaches to
consider, e.g. background parsing, or parsing up until a time limit, and
resume parsing when Emacs is idle. Tree-sitter's API supports the
latter.

But again, both thought exercises and my usage so far point to this
being a non-issue.

> IOW, the issue with exposing access to buffer text to modules is IMO
> secondary. My suggestion is first to figure out how to do this stuff
> efficiently from within Emacs itself, as if the module interface were
> not part of the equation. We can add that aspect back later.
>

My opinion is that it's better to experiment with this kind of stuff
out-of-core. It can move forward faster that way, allowing more lessons
to be learned. Real lessons, involving real-world use cases, not thought
exercises.

In a somewhat similar vein, writing emacs-tree-sitter highlighted real
issues with dynamic modules, which I'm going to write up sometime.

> And yes, doing this by consing strings is not a good idea, it will
> slow things down and cause a lot of GC. It is best avoided. Thus my
> questions above.
>
> > > Btw, what do you do with the tree returned by the tree-sitter parser?
> > > store it in some buffer-local variable? If so, how much memory does
> > > such a tree take, and when, if ever, is that memory released?
> > >
> >
> > It's stored in a buffer-local variable. I haven't measured the memory
> > they take. Memory is released when the tree object is garbage-collected
> > (it's a `user-ptr').
>
> So if I have many hundreds of buffers, I could have such a tree in
> each one of them indefinitely? Perhaps that's one more design issue
> to consider, given that the parsing is so fast. Similar to what we do
> with image and face caches -- we flush them from time to time, to keep
> the memory footprint in check. So a buffer that was not current more
> than some time interval ago could have its tree GCed.
>

That can work. Alternatively, tree-sitter can add support for "folding"
subtrees, as Stefan suggested.

--
Tuấn-Anh Nguyễn
Software Engineer

From:	Jorge Javier Araya Navarro
Subject:	Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
Date:	Wed, 1 Apr 2020 23:19:31 -0600