Skip to content

Debugging Grammar rules

rocky edited this page Oct 2, 2017 · 7 revisions

Table of Contents

Introduction

I've spent a lot of time working on and debugging grammar rules. As a result, I've accumulated some street experience.

Also in the process, I've refined the spark parser to make it easier to debug.

CheckSets()

First, the parser has an instance method called checkSets() which does some grammar verification. This returns 4 lists:

  • non-terminals on the left-hand side that are not on the right-hand side of any rule. Note that the start symbol is also somehting that is not expected to be on the right-hand side of a rule.
  • non-terminals (not tokens) that are on the right-hand side of some rule, but not on the left-hand side of a rule.
  • right-recursive rules
  • tokens or terminal symbols seen in the grammar

The last two items are not errors like the first two but are more for information. An Earley parser parser left-recursive rules faster than right-recursive rules. So when possible, those right-recursive rules should be rewritten. The list of terminal symbols may be useful in conjunction with other phases like lexing or possibly semantic actions which may also produce a list of terminal symbols to compare against.

Below I'll give grammars that errors or non-optimal behavior

These examples come from the test_checker.py unit test.

Unused LHS

Consider this grammar:

class UnusedLHS(GenericParser):
    def __init__(self, start='expr'):
        GenericParser.__init__(self, start)

    def p_unused_lhs(self, args):
        """
        expr ::= expr ADD term
        expr ::= term
        factor ::= term
        term ::= NUMBER
        """

When this python code is run:

        parser = UnusedLHS()
        lhs, rhs, tokens, right_recursive = parser.checkSets()

then the variable lhs will be set(['factor']). because factor isn't on the right-hand side of any rule. If we wrote this as:

class UnusedLHS(GenericParser):
    def __init__(self, start='calc'):
        GenericParser.__init__(self, start)

    def p_unused_lhs(self, args):
        """
        calc   ::= expr
        expr   ::= expr ADD term
        expr   ::= term
        factor ::= term
        term   ::= NUMBER
        """

We would have the same lhs value. Even though calc doesn't appear on the right-hand side of a rule, that is okay because it is the start symbol.

Unexpanded Nonterminal

Consider this grammar:

class UnexpandedNonterminal(GenericParser):
    def __init__(self, start='expr'):
        GenericParser.__init__(self, start)

    def p_expr_unexpanded(self, args):
        """
        expr ::= expr ADD term
        expr ::= term2
        term ::= NUMBER
        """

When this python code is run:

        parser = UnexpandedNonterminal()
        lhs, rhs, tokens, right_recursive = parser.checkSets()

then variable rhs will be set(['term2']) because term2 doesn't appear on the left-hand side of a rule and it isn't a terminal symbol. The parser assumes symbols in all capitals are terminal symbols.

Grammar hacking

Simplify

An useful important technique I've found to trace problems in to simpifiy: both the test case you try as well as the grammar.

In contrast to the first spark parser, you can now add comments. Please use them. Comments can also be used in testing a feature of a grammar to block out grammar rules that are irrelevant to what you are testing. This often simplifies tracing a grammar.

[give example]

Similarly, one should simplify the program that you are tracing the grammar on.

[give example]

Overspecify

Sometimes I've found it useful to put in a custom rule temporarily just to see that I get a match.

[give example]

Overgeneralize

This parser is extremely flexibile in the grammars that it accepts and the means for filtering things out.

Reduction Rules

Although this isn't preferred, one can write a very general set of rules and then check specifics in a reduction rule .

Here is an example. In Python 3, function definitions can be annotated such as like this:

def foo(a: int):

A fragment of the bytecode in some versions of Python 3, e.g 3.4, looks like this:

LOAD_NAME                'int'
LOAD_CONST               ('a',)
LOAD_CONST               <code object foo>'

And the grammar rule for the annotation arg, the middle instruction or LOAD_CONST above is:

annotate_arg ::= expr

The two LOAD_CONST instructions could also match a keyword argument. The code:

def foo(*, a=5)

gives bytecode fragment:

LOAD_CONST            'a'
LOAD_CONST            5

And the grammar rule for a keyword argument is:

kwarg ::= LOAD_CONST expr

The way the Python decompiler distinguishes these cases is to have reduction rules for both annotate_arg and kwarg. For annotate_arg its token attribute has to be a tuple for the rule to reduce. For a kwarg, its token argument has to be a string.

Tracing rules

[to be completed]

Clone this wiki locally