emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tree-sitter maturity


From: Björn Bidar
Subject: Re: Tree-sitter maturity
Date: Sun, 05 Jan 2025 01:21:23 +0200
User-agent: Gnus/5.13 (Gnus v5.13)

Lynn Winebarger <owinebar@gmail.com> writes:

> On Wed, Jan 1, 2025 at 3:23 PM Björn Bidar <bjorn.bidar@thaodan.de> wrote:
>> Lynn Winebarger <owinebar@gmail.com> writes:
>> >> Tree sitter, as wonderful as it is, strikes me as a bit of a Rube
>> >> Goldberg machine architecturally: JS *and* Rust *and* C? Really? :-)
>> >
>> > They evidently decided to use JSON and a simple schema to specify the
>> > concrete grammar, instead of creating a DSL for the purpose.
>> > Javascript is just a convenient way for embedding code into JSON the
>> > same way LISP programmers use lisp to generate S-expressions.  Once
>> > you have the JSON format generated, javascript is not used.
>> >
>> > The rest of the project is really composed of orthogonal components,
>> > the GLR grammar compiler (written in Rust) and the run-time GLR
>> > parsing engine, written in C.  The grammar compiler produces the
>> > parsing tables in the form of C source code that is compiled together
>> > with the library for a single library per grammar, but the C library
>> > does not actually require the parsing tables to be statically known at
>> > compile-time, at least the last I looked, unless some really obscure
>> > dependence.  The procedural interface to the parser just takes a
>> > pointer to the parser table data structure at run-time.
>> >
>> > Since GLR grammars are basically arbitrary (ambiguous) LR(1) grammars,
>> > the parser run-time has to implement a fairly sophisticated algorithm
>> > (graph-stacks) to be efficient.  Having implemented the LALR parser
>> > generator at least 3 times in the last couple of decades (just for my
>> > own use), generating the parse tables looks like a lot simpler (and
>> > well-understood) problem to solve than the GLR run-time.  More
>> > importantly, the efficiency of the grammar compiler is not all that
>> > critical compared to the run-time.
>> >
>>
>> Additional alernatives instead of Node are already a good alternative.
>> Using WASM as the output format also does not sound bad assuming their
>> is some abstraction from the tree-sitter library side.
>
> I'm not sure why WASM would be interesting.  AFAICT, it's just another
> set of bindings to the C library, maybe with the tables compiled into
> WASM binary module (or whatever the correct term should be - I'm not a
> WASM expert).  In any case, AFAIK Emacs has no particular capability
> for using WASM files as dynamic libraries in general.  Maybe if Emacs
> itself was compiled to WASM, in which case I suppose the function for
> dynamically loading libraries would implicitly load such modules.
>
> OTOH, the generated WASM bindings might provide an example of using
> the tree-sitter DLL with the in-memory parse table structure not
> embedded in the tree-sitter DLL.  Is that what you meant?

Maybe I missunderstood but my assumption was that the newer WASM parsers
would be less prone to breakage. But if it's just about compiling the
same code generated to WASM then I don't see the benefit either.


>> > I agree, a generic grammar capturing the structures of most
>> > programming languages would be useful.  It is definitely possible to
>> > extract the syntactic/semantic concepts from C++ and Python to create
>> > such a grammar, if you are willing to allow nested grammars
>> > appropriately delimited.  For example, a constructor context would
>> > delimit an expression in a data language that is embedded in a
>> > constructor context that may itself have delimited value contexts
>> > where the functional/procedural grammar may appear, ad infinitum.  The
>> > procedural and data grammars are distinct but mutually recursive.
>> > That would be if the form appeared in an rvalue-context.  For l-value
>> > expressions, the same constructor delimiting syntax can become a
>> > binding form, at least, with subexpressions of binding forms also
>> > being binding forms.  As long as the scanner is dynamically  set
>> > according to the grammar context (and recognizes/signals the closing
>> > delimiter), the grammar can be made non-ambiguous because a given
>> > character will produce context-appropriate terminal symbols.
>>
>> What kind of scanner are you referring to? Something that works like a
>> binding generator but for AST?
>
> Aside from being useful for generic templating purposes, Such a
> generic grammar would be of use for the purpose Daniel described, i.e.
> a layer of abstraction usable for almost any modern language, even in
> polyglot texts.
>

This exactly what I wondering too. Some languages embed others into
themselves or are hybris. Good examples would be Python inside a
template and QML is Markup but also JavaScript depending on the context.
A more flexible grammar system would help here.

Kinda like reinventing semantic again..

>> > As for vendoring, I just doubt you will get much buy-in in this forum.
>> > There are corporate-type free/open-source software projects that
>> > prioritize uniformity in build environments and limiting the scope of
>> > bugs that can arise from the build process/dependencies that vendor at
>> > the drop of the hat.  Then there are "classic" free software projects
>> > that have amalgamated the work of many individual contributors, and
>> > those contributors often prioritize control of the software running on
>> > their systems for whatever reason (but eliminating non-free software
>> > is definitely one of them), and they often can/will contribute patches
>> > for that purpose.  The second camp *hates* vendoring because it
>> > subverts their control of their computational resources.    At least,
>> > that's the dichotomy I see. There are probably finer points I'm
>> > missing or mischaracterizing.
>>
>> From my point as a distribution packager there are several reason why
>> vendoring can be bad or in some context keeping them is the better
>> decision.
>>
>> But in this context it complicates the build process as now each grammar
>> has to be built for Emacs in addition to another editors.
>> The Emacs package now pulls in more build dependencies at built time
>> which complicates the built process  as the dependency grows.
>>
>> Besides bundled dependencies are not allowed unless there's no way to
>> avoid them. It is not about control or anything.
>
> That sounds like something I would interpret as control.  Distro
> creators/maintainers are prime candidates for wanting to maintain
> control of the build/run-time environment, as they are responsible for
> everything they bundle working together.  Perhaps "control of their
> computational resources" is more specific than I intended in my
> previous posting.
>

Yeah you are right. 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]