bug-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Getting involved in Bison


From: Morales Cayuela, Victor (NSB - CN/Hangzhou)
Subject: RE: Getting involved in Bison
Date: Wed, 16 Oct 2019 05:29:54 +0000

Hello!

Considering that this is the first time I collaborate in this project I would 
like to start with something easy. First I'd like to get used to the way of 
working, code style and review, testing... etc. I could do as a first contact 
the graph generator clean up that you mentioned. I have already had a look at 
the TODO list and other issues but I can't figure out which one might also be 
easy to start with.

About my skills, I can write quite good C/C++ code (C++14), I have a long 
experience with these two languages in projects of millions of lines. I am code 
reviewer in my company and I have been awarded a few times, so this part 
shouldn't be a problem. I haven't used m4 before, although I will start 
learning asap.

Btw, how long time do you usually estimate for a feature/ to be delivered? 
Could you also let me know how to check out the source project? Should I need 
to register first in some git repository? I've never worked in open source 
projects, not really sure how they are managed.

Regards,
Victor

-----Original Message-----
From: Akim Demaille <address@hidden> 
Sent: Tuesday, October 15, 2019 2:42 PM
To: Paul Eggert <address@hidden>
Cc: Morales Cayuela, Victor (NSB - CN/Hangzhou) <address@hidden>; Bison Bugs 
<address@hidden>
Subject: Re: Getting involved in Bison

Hi Victor1

> Le 15 oct. 2019 à 06:19, Paul Eggert <address@hidden> a écrit :
> 
> On 10/14/19 7:12 PM, Morales Cayuela, Victor (NSB - CN/Hangzhou) wrote:
> 
>> Could you let me know in which areas you would need help?
> 
> Thanks for volunteering. Akim is the best person to ask.

Thanks :)

> Also, I suggest looking at Bison's TODO file for some ideas.
> 
> https://git.savannah.gnu.org/cgit/bison.git/tree/TODO

Which was the impetus I needed to update it, see below.


For a small project, Bison is quite big, and requires really different skills 
depending on where you, Victor, would like to work on.  I strongly recommend 
starting with simple things (which is != from dummy).

On the backend side (aka skeleton), in C++, how about implementing push 
parsers?  That would be very useful in several projects I know.  It moderately 
difficult to implement "by hand", but you'll certainly find that m4 is a weird 
beast.  One path would be to generate a usual pull parser for say arithmetics, 
and work it by hand to become a push parser, and later see how to move these 
changes into lalr1.cc.

In bison itself (the generator), for a simple start, I would recommend cleaning 
up the graph generation.  Today it's sort of OOP with an abstract interface for 
graph, and a concrete implementation for Dot.  This is because decades ago we 
supported a format called VCG, which has disappeared since then.  I think we 
should flatten this to a direct interface for Dot, removing all the useless 
abstractions.

There are many more possible things, but it really depends what you'd like to 
work on, and how fluent you are in C (for bison the generator) and m4 (the 
skeletons).


diff --git a/TODO b/TODO
index f3f08ce1..d2c56b73 100644
--- a/TODO
+++ b/TODO
@@ -7,9 +7,6 @@ breaks.
 Also, we seem to teach YYPRINT very early on, although it should be  
considered deprecated: %printer is superior.
 
-** glr.cc
-move glr.c into the yy namespace
-
 ** improve syntax errors (UTF-8, internationalization)  Bison depends on the 
current locale.  For instance:
 
@@ -58,7 +55,7 @@ Maybe we should exhibit the YYUNDEFTOK token.  It could also 
be assigned a  semantic value so that yyerror could be used to report invalid 
lexemes.
 
 * Bison 3.6
-** Unit rules
+** Unit rules / Injection rules (Akim Demaille)
 Maybe we could expand unit rules (or "injections", see  
https://homepages.cwi.nl/~daybuild/daily-books/syntax/2-sdf/sdf.html), i.e.,  
transform @@ -77,10 +74,12 @@ Practice' is impossible to find, but according to 
'Parsing Techniques: a  Practical Guide', it includes information about this 
issue.  Does anybody  have it?
 
-** Injection rules
-See above.
+** clean up (Akim Demaille)
+Do not work on these items now, as I (Akim) have branches with a lot of 
+changes in this area (hitting several files), and no desire to have to 
+fix conflicts.  Addressing these items will happen after my branches 
+have been merged.
 
-** clean up
 *** lalr.c
 Introduce a goto struct, and use it in place of from_state/to_state.
 Rename states1 as path, length as pathlen.
@@ -130,6 +129,84 @@ $ ./tests/testsuite -l | grep errors | sed q
   38: input.at:1730      errors
 
 * Short term
+** Stop indentation in diagnostics
+Before Bison 2.7, we printed "flatly" the dependencies in long diagnostics:
+
+    input.y:2.7-12: %type redeclaration for exp
+    input.y:1.7-12: previous declaration
+
+In Bison 2.7, we indented them
+
+    input.y:2.7-12: error: %type redeclaration for exp
+    input.y:1.7-12:     previous declaration
+
+Later we quoted the source in the diagnostics, and today we have:
+
+    /tmp/foo.y:1.12-14: warning: symbol FOO redeclared [-Wother]
+        1 | %token FOO FOO
+          |            ^~~
+    /tmp/foo.y:1.8-10:      previous declaration
+        1 | %token FOO FOO
+          |        ^~~
+
+The indentation is no longer helping.  We should probably get rid of 
+it, or maybe keep it only when -fno-caret. GCC displays this as a "note":
+
+    $ g++-mp-9 -Wall /tmp/foo.c -c
+    /tmp/foo.c:1:10: error: redefinition of 'int foo'
+        1 | int foo, foo;
+          |          ^~~
+    /tmp/foo.c:1:5: note: 'int foo' previously declared here
+        1 | int foo, foo;
+          |     ^~~
+
+Likewise for Clang, contrary to what I believed (because "note:" is 
+written in black, so it doesn't show in my terminal :-)
+
+    $ clang++-mp-8.0 -Wall /tmp/foo.c -c
+    clang: warning: treating 'c' input as 'c++' when in C++ mode, this 
behavior is deprecated [-Wdeprecated]
+    /tmp/foo.c:1:10: error: redefinition of 'foo'
+    int foo, foo;
+             ^
+    /tmp/foo.c:1:5: note: previous definition is here
+    int foo, foo;
+        ^
+    1 error generated.
+
+** Better design for diagnostics
+The current implementation of diagnostics is adhoc, it grew 
+organically.  It works as a series of calls to several functions, with 
+dependency of the latter calls on the former.  For instance:
+
+      complain (&sym->location,
+                sym->content->status == needed ? complaint : Wother,
+                _("symbol %s is used, but is not defined as a token"
+                  " and has no rules; did you mean %s?"),
+                quote_n (0, sym->tag),
+                quote_n (1, best->tag));
+      if (feature_flag & feature_caret)
+        location_caret_suggestion (sym->location, best->tag, stderr);
+
+We should rewrite this in a more FP way:
+
+1. build a rich structure that denotes the (complete) diagnostic.
+   "Complete" in the sense that it also contains the suggestions, the list
+   of possible matches, etc.
+
+2. send this to the pretty-printing routine.  The diagnostic structure
+   should be sufficient so that we can generate all the 'format' of
+   diagnostics, including the fixits.
+
+If properly done, this diagnostic module can be detached from Bison and 
+be put in gnulib.  It could be used, for instance, for errors caught by 
+xgettext.
+
+There's certainly already something alike in GCC.  At least that's the 
+impression I get from reading the "-fdiagnostics-format=FORMAT" part of 
+this
+page:
+
+https://gcc.gnu.org/onlinedocs/gcc/Diagnostic-Message-Formatting-Option
+s.html
+
 ** consistency
 token vs terminal
 
@@ -139,11 +216,10 @@ itself uses int (for yylen for instance), yet stack is 
based on size_t.
 
 Maybe locations should also move to ints.
 
-** C
-Introduce state_type rather than spreading yytype_int16 everywhere?
-
-** glr.c
-yyspaceLeft should probably be a pointer diff.
+Paul Eggert already covered most of this.  But before publishing these 
+changes, we need to ask our C++ users if they agree with that change, 
+or if we need some migration path.  Could be a %define variable, or 
+simply %require "3.5".
 
 ** Graphviz display code thoughts
 The code for the --graph option is over two files: print_graph, and @@ -164,9 
+240,6 @@ Little effort seems to have been given to factoring these files and 
their  rint{,-xml} counterpart. We would very much like to re-use the pretty 
format  of states from .output for the graphs, etc.
 
-Also, the underscore in print_graph.[ch] isn't very fitting considering the 
-dashes in the other filenames.
-
 Since graphviz dies on medium-to-big grammars, maybe consider an other tool?
 
 ** push-parser
@@ -224,11 +297,13 @@ since it is no longer bound to a particular parser, it's 
just a  (standalone symbol).
 
 * Various
-** Rewrite glr.cc in C++
+** Rewrite glr.cc in C++ (Valentin Tolmer)
 As a matter of fact, it would be very interesting to see how much we can  
share between lalr1.cc and glr.cc.  Most of the skeletons should be common.
 It would be a very nice source of inspiration for the other languages.
 
+Valentin Tolmer is working on this.
+
 ** YYERRCODE
 Defined to 256, but not used, not documented.  Probably the token  number for 
the error token, which POSIX wants to be 256, but which @@ -298,10 +373,21 @@ 
other improvements and also made it faster (probably because memory  management 
is performed once instead of three times).  I suggest that  we do the same in 
yacc.c.
 
+(Some time later): it's also very nice to have three stacks: it's more 
+dense as we don't lose bits to padding.  For instance the typical stack 
+for states will use 8 bits, while it is likely to consume 32 bits in a struct.
+
+We need trustworthy benchmarks for Bison, for all our backends.  Akim 
+has a few things scattered around; we need to put them in the repo, and 
+make them more useful.
+
 ** yysyntax_error
 The code bw glr.c and yacc.c is really alike, we can certainly factor  some 
parts.
 
+This should be worked on when we also address the expected improvements 
+for error generation (e.g., i18n).
+
 
 * Report
 
@@ -341,7 +427,26 @@ LORIA, INRIA Nancy - Grand Est, Nancy, France
 
 * Extensions
 ** Multiple start symbols
-Would be very useful when parsing closely related languages.
+Would be very useful when parsing closely related languages.  The idea 
+is to declare several start symbols, for instance
+
+    %start stmt expr
+    %%
+    stmt: ...
+    expr: ...
+
+and to generate parse(), parse_stmt() and parse_expr().  Technically, 
+the above grammar would be transformed into
+
+   %start yy_start
+   %token YY_START_STMT YY_START_EXPR
+   %%
+   yy_start: YY_START_STMT stmt | YY_START_EXPR expr
+
+so that there are no new conflicts in the grammar (as would undoubtedly 
+happen with yy_start: stmt | expr).  Then adjust the skeletons so that 
+this initial token (YY_START_STMT, YY_START_EXPR) be shifted first in 
+the corresponding parse function.
 
 ** Better error messages
 The users are not provided with enough tools to forge their error messages.
@@ -359,6 +464,12 @@ should make this reasonably easy to implement.
 Bruce Mardle <address@hidden>
 https://lists.gnu.org/archive/html/bison-patches/2015-09/msg00000.html
 
+However, there are many other things to do before having such a 
+feature, because I don't want a % equivalent to #include (which we all 
+learned to hate).  I want something that builds "modules" of grammars, 
+and assembles them together, paying attention to keep separate bits 
+separated, in pseudo name spaces.
+
 ** Push parsers
 There is demand for push parsers in Java and C++.  And GLR I guess.
 
@@ -385,6 +496,10 @@ must be in the scanner: we must not parse what is in a 
switched off  part of %if.  Akim Demaille thinks it should be in the parser, so 
as  to avoid falling into another CPP mistake.
 
+(Later): I'm sure there's actually good case for this.  People who need 
+that feature can use m4/cpp on top of Bison.  I don't think it is worth 
+the trouble in Bison itself.
+
 ** XML Output
 There are couple of available extensions of Bison targeting some XML  output.  
Some day we should consider including them.  One issue is @@ -404,6 +519,9 @@ 
XML output for GNU Bison  
https://lists.gnu.org/archive/html/bug-bison/2016-06/msg00000.html
 http://www.cs.cornell.edu/andru/papers/cupex/
 
+Andrew Myers and Vincent Imbimbo are working on this item, see
+https://github.com/akimd/bison/issues/12
+
 * Coding system independence
 Paul notes:
 
@@ -433,6 +551,7 @@ It is unfortunate that there is a total order for 
precedence.  It  makes it impossible to have modular precedence information.  
We should  move to partial orders (sounds like series/parallel orders to me).
 
+This is a prerequisite for modules.
 
 * $undefined
 From Hans:


reply via email to

[Prev in Thread] Current Thread [Next in Thread]