[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bison lexer
From: |
Hans Åberg |
Subject: |
Re: Bison lexer |
Date: |
Sat, 15 Sep 2018 11:30:31 +0200 |
> On 15 Sep 2018, at 07:02, Akim Demaille <address@hidden> wrote:
>
>> Le 29 août 2018 à 15:56, Hans Åberg <address@hidden> a écrit :
>>
>> I did that, too: I wrote some DFA/NFA code, and incidentally found the most
>> efficient method make action matches via a reverse NFA lookup, cf. [1-3].
>> Also, I have made UTF-8/32 to octet character class translations.
>>
>> 1. https://gcc.gnu.org/ml/libstdc++/2018-04/msg00032.html
>> 2. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472
>> 3. https://gcc.gnu.org/ml/libstdc++/2018-05/msg00015.html
>
> That was interesting.
Thanks. I wanted a dynamic lexer and and at least a partially dynamic parser so
users define their own operators. One thing that remains with the lexer is the
backreferenses (see below).
> I found that Tim Shen exposed his work on
> <regex> https://www.youtube.com/watch?v=N_rkHzhXueo.
I haven't seen all.
> When it comes to conversion from expressions to automaton, I’m
> a big fan of Brzozozski’s derivatives, that, in addition, easily
> supported complement and intersection.
I just have a C++ automaton class that builds the NFA directly through
operators corresponding to those of a regular expression. The NFA then has no
empty transitions, and a set of start states, which correspond to the singel
DFA start state, in the subset construction. For example, for alteration just
take the union of both the NFAs and their start state sets. A working regex
implementation then has some additions, such as loops for count matches.
> No idea about group
> captures, and certainly not backward references.
Then when building the NFA, its start and end states form a group, which can be
identified with unique number, if you so will. Backreferences I think of
working so that when it appearing, one makes a lookup of its value by the
reverse NFA method I give, and then inserting it as a dynamic NFA. Strictly,
the value of the backreference may then change as one comes to a new one, but I
suspect those that invented the concept have not considered that.
> redgrep implements this approach. This talks touches the case
> of capturing groups.
>
> https://www.youtube.com/watch?v=CMhqlRBfVX4&t=8s&frags=pl%2Cwn
The method I give in effect the sub-NFA that the input string uses, so the
group capture is automatic. Then working together towards DFA minimalization,
it turns out that one cannot even use the DFA, because different group markers
may be merged in to the same DFA state. Any DF minimalization must then keep
track of that. So it may be similar to LALR state merging, where one must to
keep track of the whole rules.