bison-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lexer support?


From: Lex Spoon
Subject: Re: lexer support?
Date: Fri, 1 Mar 2024 12:42:11 -0500

Kaz, are you disagreeing on the general shape of my proposal, or are you
just taking issue with calling it "lexer support"? I often edit my emails
down because everyone says they are too long. Maybe I should have said,
"better lexer support".

I agree about the capabilities of Bison that you describe, and I agree on
most of your design points. I think you actually conceded, though, the
general idea that Bison is pretty spare on its existing lexer support? For
example, you wrote:

The grammar language isn't very expressive for that (e.g. for repeat
matching similar to a regex Kleene star, you have to write recursive
rules, and alternation is likewise clumsy). But it has the power.


The test of a good tool is not just that it provides power at all, but that
it makes that power accessible. I think we agree that something like
[a-z][a-z0-9_]+ is nice for developers if Bison could support it?

Some things you can do in Lex are going to be difficult...


Quite. At the risk of being snarky, that is why I spent a year and a half
on this project. You end up needing to go to a separate tool, in practice,
and as things stand, there are a lot of unnecessary little problems that
come up.

...namely those that are made possible by start conditions.. You can't
easily put a Bison-generated parser into different states in which
different rules are active or dormant


I actually think you gave up too soon, here. Bison parsers have their own
way of managing modes, and I think there's some interesting research on
combining that with lexical modes in some way. To see what I mean, consider
this parse rule:

stmt: "val" id "=" expr.


Imagine the parser has just shifted the "=" token and is now going to try
and gobble an expr. That's a mode! The parser knows that the next token has
to be one that can start an expression. The lexer has no idea about this
mode, though, and so the lexer has to track its own modes in its own way. I
think a knock-on benefit from Bison having a unified grammar file would be
to enable research on this kind of thing.

This can all be done in Bison...


I am not sure this is true for larger examples, even though it's true for
the toy calculator example that we use all the time. In a larger grammar, a
developer will run into harder problems than just unrolling Kleene star.
Here are some reasons to believe that it's helpful to use a lexer generator
such as my namesake:

1. The Lex+Yacc combo has been enormously successful in industry. It's
taught in the majority of compiler classes, and, remarkably, people
actually use that approach when they go out on the job. In my long career,
I can't immediately remember a parser that didn't have a separate layer for
the lexer. Most of the time, the lexer specifically uses Flex or a tool
derived from it. Even for hand-written parsers, such as the one in Scala,
they tend to have a separate layer
<https://github.com/scala/scala/blob/2.13.x/src/compiler/scala/tools/nsc/ast/parser/Scanners.scala>
for
the lexer, and the general style of the Scala lexer is pretty similar to a
DFA-based lexer. Other things are possible, but the general Lex+Yacc combo
is *the* way that people implement parsers right now. I'm enough of a geek
to go there with you a little bit, but we would be way out on a limb to
really say that Lex-style lexers are obsolete or are unnecessary.

2. Language specifications often separate out the lexical tokens as their
own thing. When the spec is structured a certain way, it is often helpful
for the implementation to be structured the same way. As a
concrete example, the rules for significant newlines are generally based on
the tokens that occur before and after the newline. Such a rule can only be
easily implemented if the implementation has some notion of a token stream
at all.

3. Bison's design and mindshare are overwhelmingly oriented around the idea
of a separate lex routine. For example, the manual states
<https://www.gnu.org/software/bison/manual/html_node/Language-and-Grammar.html>,
"(These tokens can be subdivided into characters, but that is a matter of
lexicography, not grammar.)".

4. The developers of Yacc and Lex did not believe it would work well to use
Yacc by itself. The first two sentences of the original Lex paper
<https://epaperpress.com/lexandyacc/download/lex.pdf> talk about
integration with "a parse tool", and the paper mentions Yacc by name over
20 times. The original Yacc paper
<https://www.cs.utexas.edu/users/novak/yaccpaper.htm> mentions Lex by name,
and furthermore says you wouldn't want to use Yacc by itself. For example,
the Yacc paper says, "Such low-level rules tend to waste time and space,
and may complicate the specification beyond Yacc's ability to deal with
it.".

5. I am groping here, but it seems like a problem if the 1 in LALR(1) were
for one character instead of one token. A Bison parser looks ahead by one
token to decide what to do, and sometimes it needs more than one character
to know what kind of token it is looking at. Here's a concrete example of
what I mean. Suppose the parse stack has these symbols on it:  ["begin"
"if" exp "then"]. If the next token is "else", then the parser should shift
the "else" and is eventually going to reduce an if-then-else expression.
Now we assume that instead of a token, Bison just sees a single
character, "e". What does it do? It's facing a shift-reduce conflict.
Suppose it shifts it, and the input continues l s e. In this case, the
shift worked out well. Bison will reduce an "else" token, go a little
further, and eventually reduce an entire if-then-else expression. However,
the shift may not have been the correct thing to do. If the next characters
are e n d, then this strategy won't work. It shifts e n d, reduces it to
"end", and gives us this parse stack: ["begin" "if" exp "then" "end"]. Now
the parser needs to reduce the three symbols in the middle of the stack,
but Bison parsers aren't allowed to reduce in the middle. So, we can't
embed the lex rules in the parse rules without abandoning the general
approach of LALR.

It's fun to meet you, Kaz, and to geek out about all of this!

Lex Spoon


reply via email to

[Prev in Thread] Current Thread [Next in Thread]