-
-
Notifications
You must be signed in to change notification settings - Fork 8
Debugging Grammar rules
Table of Contents
I've spent a lot of time working on and debugging grammar rules. As a result, I've accumulated some street experience.
Also in the process, I've refined the spark parser to make it easier to debug.
First, the parser has an instance method called checkSets() which does some grammar verification. This returns 4 lists:
- non-terminals on the left-hand side that are not on the right-hand side of any rule. Note that the start symbol is also somehting that is not expected to be on the right-hand side of a rule.
- non-terminals (not tokens) that are on the right-hand side of some rule, but not on the left-hand side of a rule.
- right-recursive rules
- tokens or terminal symbols seen in the grammar
The last two items are not errors like the first two but are more for information. An Earley parser parser left-recursive rules faster than right-recursive rules. So when possible, those right-recursive rules should be rewritten. The list of terminal symbols may be useful in conjunction with other phases like lexing or possibly semantic actions which may also produce a list of terminal symbols to compare against.
Below I'll give grammars that errors or non-optimal behavior
These examples come from the test_checker.py
unit test.
Consider this grammar:
class UnusedLHS(GenericParser):
def __init__(self, start='expr'):
GenericParser.__init__(self, start)
def p_unused_lhs(self, args):
"""
expr ::= expr ADD term
expr ::= term
factor ::= term
term ::= NUMBER
"""
When this python code is run:
parser = UnusedLHS()
lhs, rhs, tokens, right_recursive = parser.checkSets()
then the variable lhs
will be set(['factor'])
. because factor
isn't on the right-hand side of any rule. If we wrote this as:
class UnusedLHS(GenericParser):
def __init__(self, start='calc'):
GenericParser.__init__(self, start)
def p_unused_lhs(self, args):
"""
calc ::= expr
expr ::= expr ADD term
expr ::= term
factor ::= term
term ::= NUMBER
"""
We would have the same lhs
value. Even though calc
doesn't appear on the right-hand side of a rule, that is okay because it is the start symbol.
Consider this grammar:
class UnexpandedNonterminal(GenericParser):
def __init__(self, start='expr'):
GenericParser.__init__(self, start)
def p_expr_unexpanded(self, args):
"""
expr ::= expr ADD term
expr ::= term2
term ::= NUMBER
"""
When this python code is run:
parser = UnexpandedNonterminal()
lhs, rhs, tokens, right_recursive = parser.checkSets()
then variable rhs
will be set(['term2'])
because term2
doesn't appear on the left-hand side of a rule and it isn't a terminal symbol. The parser assumes symbols in all capitals are terminal symbols.
An useful important technique I've found to trace problems in to simpifiy: both the test case you try as well as the grammar.
In contrast to the first spark parser, you can now add comments. Please use them. Comments can also be used in testing a feature of a grammar to block out grammar rules that are irrelevant to what you are testing. This often simplifies tracing a grammar.
[give example]
Similarly, one should simplify the program that you are tracing the grammar on.
[give example]
Sometimes I've found it useful to put in a custom rule temporarily just to see that I get a match.
[give example]
This parser is extremely flexibile in the grammars that it accepts and the means for filtering things out.
Although this isn't preferred, one can write a very general set of rules and then check specifics in a reduction rule .
Here is an example. In Python 3, function definitions can be annotated such as like this:
def foo(a: int):
A fragment of the bytecode in some versions of Python 3, e.g 3.4, looks like this:
LOAD_NAME 'int'
LOAD_CONST ('a',)
LOAD_CONST <code object foo>'
And the grammar rule for the annotation arg, the middle instruction or LOAD_CONST
above is:
annotate_arg ::= expr
The two LOAD_CONST
instructions could also match a keyword argument.
The code:
def foo(*, a=5)
gives bytecode fragment:
LOAD_CONST 'a'
LOAD_CONST 5
And the grammar rule for a keyword argument is:
kwarg ::= LOAD_CONST expr
The way the Python decompiler distinguishes these cases is to have reduction rules for both annotate_arg and kwarg. For annotate_arg its token attribute has to be a tuple for the rule to reduce. For a kwarg, its token argument has to be a string.
[to be completed]