emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Update on tree-sitter structure navigation


From: Yuan Fu
Subject: Re: Update on tree-sitter structure navigation
Date: Sat, 2 Sep 2023 15:09:08 -0700


> On Sep 1, 2023, at 11:52 PM, Ihor Radchenko <yantar92@posteo.net> wrote:
> 
> Yuan Fu <casouri@gmail.com> writes:
> 
>> In the months after wrapping up tree-sitter stuff in emacs-29, I was
>> thinking about how to implement structural navigation and extracting
>> information from the parser with tree-sitter. In emacs-29 we have
>> things like treesit-beginning/end-of-defun, and treesit-defun-name. I
>> was thinking maybe we can generalize this to support getting arbitrary
>> “thing” at point, move around them, and getting information like the
>> name of a defun, its arglist, parent of a class, type of an variable
>> declaration, etc, in a language-agnostic way.
> 
> Note that Org mode also does all of these using
> https://orgmode.org/worg/dev/org-element-api.html
> 
> It would be nice if we could converge to more consistent interface
> across all the modes. For example, by extending `thing-at-point' to handle
> parsed elements, not just simplistic regexp-based "thing" boundaries
> exposed by `thing-at-point' now.
> 
> Org approaches getting name/begin/end/arguments using a common API:
> 
> (org-element-property :begin NODE)
> (org-element-property :end NODE)
> (org-element-property :contents-begin NODE)
> (org-element-property :contents-end NODE)
> (org-element-property :name NODE)
> (org-element-property :args NODE)
> 
> Language-agnostic "thing"s will certainly be welcome, especially given
> that tree-sitter grammars use inconsistent naming schemes, which have to
> be learned separately, and may even change with grammar versions.
> 
> I think that both NODE types and attributes can be standardized.

If we come up with a thing-at-point interface that provides more information 
than the current (BEG . END), tree-sitter surely can support it as a backend. 
Just need SomeOne to come up with it :-) But I don’t see how this interface can 
support semantic information like arglist of a defun, or type of a 
declaration—these things are not universal to all “nodes”.

> 
>> Also, at the time, we only support defining things by a regexp
>> matching a node’s type, which is often not enough.
>> 
>> And it would be nice to somehow take advantage of the tree-sitter
>> queries for the features I mentioned above. Tree-sitter query is what
>> every other editor are using for virtually all tree-sitter related
>> features. But in Emacs, we mostly only use it for font-lock.
> 
> I recall one user asking about something like VIM's textobjects via
> tree-sitter queries. Example:
> https://github.com/nvim-treesitter/nvim-treesitter-textobjects/blob/master/queries/cpp/textobjects.scm

I think that’s something that can be implemented with thing definitions.


>> Here’s the progress as of now:
>> 
>> - Functions like treesit-search-forward, treesit-induce-sparse-tree,
>> treesit-thing-at-point, treesit--navigate-thing, etc, support a richer
>> set of predicates now. Besides regexp matching the type, the predicate
>> can also be a predication function, or (REGEP . FUNC), or compound
>> predicates like (or PRED PRED) or (not PRED).
> 
> Slightly unrelated, but do you have any idea if it can be faster to use
> Emacs' regexp search combined with treesit-thing-at-point vs. pure
> tree-sitter query?

Not really.

> 
>> - There’s now a variable treesit-thing-settings, which holds
>> definition for things. Then, instead of passing the predicate to the
>> functions I mentioned above, you can save the predicate in
>> treesit-thing-settings under a symbol, say ‘sexp', and pass the symbol
>> instead, just like thing-at-point.el. (We’ll work on integrating with
>> thing-at-point.el later.)
> 
> This sounds similar to textobjects I linked above.
> One question: how will it integrate with multiple parsers in one buffer?

This only concerns with checking if a node satisfies the definition of a 
“thing”, and doesn’t care how you get the node. Retrieving node through either 
treesit-node-at or other functions already works with multiple parsers.

Also the “thing” definition is language-specific.

> 
>> - I can’t think of a good way to integrate tree-sitter queries with
>> the navigation functions we have right now. Most importantly,
>> tree-sitter query always search top-down, and you can’t limit the
>> depth it searches. OTOH, our navigation functions work by traversing
>> the tree node-to-node.
> 
> May you elaborate about the difficulties you encountered?

Ideally I’d like to pass a query and a node to treesit-node-match-p, which 
returns t if the query matches the node. But queries don’t work like that. They 
search the node and returns all the matches within that node, which could be 
potentially wasteful.

> 
>> Some other things on the TODO list that people can take a jab at:
>> 
>> - Solve the grammar versioning/breaking-change problem: tree-sitter grammar 
>> don’t have a version number, so every time the author changes the grammar, 
>> our queries break, and loading the mode only produces a giant error.
> 
> May we somehow get a hash of the library? That way, we can at least
> detect if something has changed.

All we get is a binary dynamic library. So I don’t think so.

> 
>> - Major mode fallback/inheritance, this has been discussed many times, no 
>> good solution emerged.
> 
> I think that integration of tree-sitter with navigation functions might
> be a step towards solving this problem. If common Emacs commands can
> automatically choose between tree-sitter and classic implementations, it
> might become easier to unify foo-ts-mode with foo-mode.

Unifying tree-sitter and non-tree-sitter modes creates many problems. I’m 
rather thinking about some way to share some configuration between two modes. 
We’ve had many discussions before with no fruitful conclusion.

> 
>> - Isolated ranges. For many embedded languages, each blocks should be 
>> independent from another, but currently all the embedded blocks are 
>> connected together and parsed by a single parser. We probably need to spawn 
>> a parser for each block. I’ll probably work on this one next.
> 
> Do you mean that a single parser sees subsequent block as a continuation
> of the previous?

Exactly.

Yuan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]