Skip to content

How does this code work?

R. Bernstein edited this page Jan 24, 2018 · 71 revisions

Table of Contents

When I first looked at uncompyle2 I was a little confused as to how this code worked, even though I am familiar with compilers and compiler technology.

Part of the confusion I think was due to the terminology used and how that is used compared to uses in other settings. Part I think was a little bit of hype on how deparsing is all table driven now, and part was just the novelty of everything.

So here I'd like to clarify terminology used and described how the code works.

Scanning

"Scanner" is a term used in programming-language compilers and usually refers to the token recognition from ASCII or unicode characters. Often tokens are described by regular expressions. Typical kinds of tokens you'll find are identifiers, reserved words, strings, numbers and so on.

Here, the input to the "scanner" is not ASCII characters but CPython bytecodes. And the tokens are CPython byte-code instructions. Scanning here has very little or nothing to do with regular expressions.

I think a more appropriate word here would be "Disassembler". At any rate that is closer to how the various scanners here function. However, in contrast to disassembly, we make additional transformations described below.

So already I have a little explaining to do for how we get the bytecodes. In Python, the modules load and unmarshal from the marshal library do this; load reads a CPython-compiled file and unmarshal extracts this into a "code" object. To extract instructions from the code object, Python module dis can be used.

However we can't use these version-specific modules; we need modules that can handle bytecodes running from a different version of Python. We also need to deal with the fact that Python bytecode opcodes and the "code" types have changed. We handle both problems via another cross-version disassembly program called xdis.

As the term "disassembly" (in contrast to "scanning") implies, there's more to this than what could be done via regular expressions. Regular expression matching can be done in a single pass over the input. Routines in the xdis module, disassembly make more than a single pass. In particular, a pass is made over the instruction sequence to find the targets of jump instructions.

In sum, via the xdis package we get a series of structured bytecode instructions that we can use as tokens to start with.

But we massage the instruction stream even more. This is called "ingestion". Again this makes passes over the instruction stream. It does this for the sole purpose of making grammar parsing easier.

Consider for example this Python code:

    assert True

The Python disassembler library disassembles this as:

  1        0 LOAD_NAME               0 (True)
           3 POP_JUMP_IF_TRUE       12
           6 LOAD_GLOBAL             1 (AssertionError)
           9 RAISE_VARARGS           1
      >>  12 LOAD_CONST              0 (None)
          15 RETURN_VALUE

Note: the ">>" mark was added as a result of the pass mentioned before that looks for the targets of labels.

In our disassembler, we change the LOAD_GLOBAL (AssertionError) instruction into a made-up instruction we call LOAD_ASSERT.

This allows us to write a grammar rule:

        assert ::= assert_expr jmp_true LOAD_ASSERT RAISE_VARARGS_1

rather than:

        assert ::= assert_expr jmp_true LOAD_GLOBAL RAISE_VARARGS_1

Overall this makes the parsing faster since we've reduced the number of grammar rules that match. It also reduces the amount of grammar ambiguity.

LOAD_CONST can also be turned into LOAD_LAMBDA, LOAD_GENEXPR, LOAD_SETCOMP, or LOAD_LISTCOMP as appropriate.

Other processing turns JUMP_ABSOLUTE opcodes into CONTINUE, JUMP_FORWARD, or JUMP_BACK as appropriate; also RETURN_VALUE can be changed to RETURN_END_IF when appropriate.

In Python 2,CALL_FUNCTION opcode names are turned into CALL_FUNCTION_n where n is the number of positional arguments the function has. In Python 3 this transformation is also done, but it is done in a parsing customization phase described in the next section.

At the beginning of some basic blocks, COME_FROM instructions are inserted. In all cases the replaced instruction names are made up and are different from real CPython instruction names. For earlier Python versions this assisted a lot in able to write grammar rules to easily match control-structures. However as the jump optimization has gotten more advanced. We need to have more sophisticated rules for tracking some of this down.

Some of this might not be strictly necessary, but I think it helps clarity which is a big concern when you have to debug a problem. And as mentioned above, It probably speeds up parsing too.

Parsing

As just described, "scanning" or ingesting has already done a fair bit of set up to the tokens that get fed into the parser. Even with all of this, the code allows for even more parser setup before the parsing occurs. These are called "custom parse rules".

CPython bytecode is stack based: instructions push and pop values from an evaluation stack. For example CPython bytecode for a + b is

           0 LOAD_NAME               0 (a)
           3 LOAD_NAME               1 (b)
           6 BINARY_ADD

and the grammar rule that covers this is:

    binary_expr ::= expr expr binary_op

Some instructions like CALL_FUNCTION take an arbitrary number arguments. As mentioned in the last section, before parsing proper, we have changed the opcode to include the number of required positional arguments, e.g. CALL_FUNCTION_0. For a program to parse correctly, every instruction sequence has to be covered by a grammar rule.

But what grammar rule would you write for an instruction with an arbitrary number of arguments like CALL_FUNCTION?

We don't want to handle these with grammar rules like this:

    call_function ::= CALL_FUNCTION_0
    call_function ::= expr CALL_FUNCTION_1
    call_function ::= expr expr CALL_FUNCTION_2
    ...

which would slow down parsing immensely. And we'd also have to put a limit on the number of positional arguments allowed. So instead we add the grammar rules that are specific to those kinds of calls that will be encountered in the instruction sequence to be parsed. That is, we will add:

    call_function ::= expr CALL_FUNCTION_1

only if we encounter a CALL_FUNCTION in the instruction sequence that requires exactly one positional argument. And we add that grammar rule only once, no matter how many single parameter functional calls there are.

Semantic Analysis

The semantic analysis pass is what generates Python source code or source code fragments from the parse tree produced by the parser. In uncompyle/decompile this is called the Walker class. In uncompyle6 there are two modules pysource and fragments and the main classes that walk the tree are calleed SourceWalker and FragmentsWalker. The two are similar and FragmentsWalker is a subclass of SourceWalker. However since FragmentsWalker's needs are a little different, it is a separate class and module.

As with parsing, the semantic analysis phase allows for custom rules to added before walking the tree. In particular, semantic actions for each of the custom function rules need a semantic-action counterpart.

The Walker classes are subclasses of GenericASTTraversal which is included in the spark module. The main function in there is the method to traverse the tree. It is called, somewhat erroneously, "preorder" For each node with typestring name name, if the instance has a method called n_name, call that before walking children. If there is no method defined, the method self.default() is called instead. the node has a method called name_exit, that is called after all children have been called. So in this sense this function is both preorder and postorder combined.

For a detailed explanation of the machinations in semantic analysis see Table driven semantic actions.

Clone this wiki locally