[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH 00/17] RFC: multiple start symbols
From: |
Akim Demaille |
Subject: |
[PATCH 00/17] RFC: multiple start symbols |
Date: |
Sun, 20 Sep 2020 10:37:32 +0200 |
Hi,
This is something I wanted to work on for quite a long time: allowing to
define several entry points in the grammar. What follows is far from
being complete:
- it works only with yacc.c,
- I have not yet looked at special cases such as push parsers,
- there is no documentation,
- nothing in the test suite (but examples/c/lexcalc does demonstrate it).
However, before proceeding I would like to get feedback.
Start symbols can be either nonterminal, or tokens (which might be useful
for people willing to write test cases).
Start symbols may have be valueless, or have a semantic value. In the
latter case, "api.value.type union" _must_ be used, so that Bison knows
exactly the type of the semantic value, which allows to provide a dedicated
parsing function returning the symbol's value.
Let's consider lexcalc as an example:
%define api.value.type union
%token <int> NUM "number"
%type <int> exp expression line
%start input expression NUM
produces three parsing functions.
Since "input" is valueless, its generated parsing function is:
// Return type when parsing one input.
typedef struct
{
int yystatus;
int yynerrs;
} yyparse_input_t;
// Parse one input.
yyparse_input_t yyparse_input (void);
Having new functions is a chance to get the signature right: the traditional
one really looks like a hack on top of a design made for global variables:
capturing the semantic value of the parse, getting the number of errors,
etc. was clumsy. Hence these new structs, returned by the specific parsing
functions.
In the case of symbols with values, one gets:
// Return type when parsing one expression.
typedef struct
{
int yyvalue;
int yystatus;
int yynerrs;
} yyparse_expression_t;
// Parse one expression.
yyparse_expression_t yyparse_expression (void);
If you're curious, multiple-start-symbols support uses an old trick:
inserting a "switching token" for each start symbol, and have the "real"
initial symbol branch on the corresponding rule. In the case of lexcalc,
the report shows this:
0 $accept: YY_PARSE_NUM "number" $end
1 | YY_PARSE_expression expression $end
2 | YY_PARSE_input input $end
The parsing function just passes the switching token as initial token to the
"real" parsing function (yyparse_yyimpl):
yyparse_expression_t
yyparse_expression (void)
{
yyparse_expression_t yyres;
yyparse_yyimpl_t yyimpl;
yyres.yystatus = yyparse_yyimpl (YY_PARSE_expression, &yyimpl);
yyres.yyvalue = yyimpl.yyvalue.TOK_expression;
yyres.yynerrs = yyimpl.yynerrs;
return yyres;
}
Having several initial rules changes the parser termination: we used to
check if we reached _the_ final state, but now there are several. Rather
than making the condition more complex and costly (it was run each time we
entered a new state), I am now recognizing the reduction of the start rules
themselves (so this is checked only when reducing, in which/case we already
have a switch/case statement for the action).
So the generated actions look like this:
switch (yyn)
{
case 1: /* $accept: YY_PARSE_NUM "number" $end */
{ (yyimpl->yyvalue.TOK_NUM) = (yyvsp[-1].TOK_NUM); YYACCEPT; }
break;
case 2: /* $accept: YY_PARSE_expression expression $end */
{ (yyimpl->yyvalue.TOK_expression) = (yyvsp[-1].TOK_expression); YYACCEPT; }
break;
case 3: /* $accept: YY_PARSE_input input $end */
{ YYACCEPT; }
break;
The branch is available https://github.com/akimd/bison/pull/42 too.
Comments most welcome.
Akim Demaille (17):
gram: more debugging information
reader: get ready to create several initial rules
parser: expose a list of symbols
regen
multistart: turn start symbols into rules on $accept
regen
multistart: adjust computation of initial core and adjust reports
multistart: also check the HTML report
multistart: pass the list of start symbols to the backend
multistart: equip yacc.c
multistart: toy with it in lexcalc
todo: more
multistart: adjust reader checks for generated rules
multistart: use b4_accept instead of action post-processing
multistart: allow tokens as start symbols
yacc.c: also count calls to YYERROR in yynerrs
multistart: also give access to yynerrs
TODO | 38 +
data/README.md | 8 +-
data/skeletons/bison.m4 | 14 +
data/skeletons/yacc.c | 101 +-
examples/c/lexcalc/lexcalc.test | 30 +-
examples/c/lexcalc/parse.y | 54 +-
examples/c/lexcalc/scan.l | 4 +-
src/gram.c | 19 +-
src/graphviz.c | 14 +-
src/lr0.c | 7 +-
src/main.c | 1 +
src/output.c | 22 +
src/parse-gram.c | 576 +++++------
src/parse-gram.h | 14 +-
src/parse-gram.y | 27 +-
src/print-xml.c | 9 +-
src/print.c | 7 +-
src/reader.c | 234 +++--
src/reader.h | 9 +-
src/reduce.c | 19 +-
src/scan-code.h | 7 +
src/scan-code.l | 18 +-
src/symtab.c | 22 +-
src/symtab.h | 5 -
tests/reduce.at | 66 +-
tests/report.at | 1597 +++++++++++++++++++++++++++++++
26 files changed, 2469 insertions(+), 453 deletions(-)
--
2.28.0
- [PATCH 00/17] RFC: multiple start symbols,
Akim Demaille <=
- [PATCH 01/17] gram: more debugging information, Akim Demaille, 2020/09/20
- [PATCH 02/17] reader: get ready to create several initial rules, Akim Demaille, 2020/09/20
- [PATCH 03/17] parser: expose a list of symbols, Akim Demaille, 2020/09/20
- [PATCH 04/17] regen, Akim Demaille, 2020/09/20
- [PATCH 05/17] multistart: turn start symbols into rules on $accept, Akim Demaille, 2020/09/20
- [PATCH 06/17] regen, Akim Demaille, 2020/09/20
- [PATCH 07/17] multistart: adjust computation of initial core and adjust reports, Akim Demaille, 2020/09/20
- [PATCH 08/17] multistart: also check the HTML report, Akim Demaille, 2020/09/20
- [PATCH 09/17] multistart: pass the list of start symbols to the backend, Akim Demaille, 2020/09/20
- [PATCH 10/17] multistart: equip yacc.c, Akim Demaille, 2020/09/20