Skip to content

How does this code work?

R. Bernstein edited this page Sep 4, 2016 · 71 revisions

When I first looked at uncompyle2 I was a little confused as to how this code worked, even though I am familiar with compilers and compiler technology.

Part of the confusion I think was due to the terminology used and how that is used compared to uses in other settings. Part I think was a little bit of hype on how deparsing is all table driven now, and part was just the novelty of everything.

So here I'd like to clarify terminology used and described how the code works.

Scanning

"Scanner" is a term used in programming-language compilers and usually refers to the token recognition from ASCII or unicode characters. Often tokens are described by regular expressions. Typical kinds of tokens you'll find are identifiers, reserved words, strings, numbers and so on.

Here, the input to the "scanner" is not ASCII characters but CPython bytecodes. And the tokens are CPython byte-code instructions. Scanning here has very little or nothing to do with regular expressions.

I think a more appropriate word here would be "Disassembler". At any rate that is closer to how the various scanners here function. However, in contrast to disassembly, we make additional transformations described below.

So already I have a little explaining to do for how we get the bytecodes. In Python, the modules load and unmarshal from the marshal library do this; load reads a CPython-compiled file and unmarshal extracts this into a "code" object. To extract instructions from the code object, Python module dis can be used.

However we can't use these version-specific modules; we need modules that can handle bytecodes running from a different version of Python. We also need to deal with the fact that Python bytecode opcodes and the "code" types have changed. We handle both problems via another cross-version disassembly program called xdis.

As the the term "disassembly" (in contrast to "scanning") implies, there's more to this than what could be done via regular expressions. Regular expression matching can be done in a single pass over the input. Routines in the xdis module, disassembly make more than a single pass. In particular, a pass is made over the instruction sequence to find the targets of jump instructions.

In sum, via the xdis package we get a series of structured bytecode instructions that we can use as tokens to start with.

But we massage the instruction stream even more. This is called "ingestion". Again this makes passes over the instruction stream. It does this for the sole purpose of making grammar parsing easier.

Consider for example this Python code:

    assert True

The Python disassembler library disassembles this as:

  1        0 LOAD_NAME               0 (True)
           3 POP_JUMP_IF_TRUE       12
           6 LOAD_GLOBAL             1 (AssertionError)
           9 RAISE_VARARGS           1
      >>  12 LOAD_CONST              0 (None)
          15 RETURN_VALUE

Note: the ">>" mark was added as a result of the pass mentioned before that looks for the targets of labels.

In our disassembler, we change the LOAD_GLOBAL (AssertionError) instruction into a made-up instruction we call LOAD_ASSERT.

This allows us to write a grammar rule:

        assert ::= assert_expr jmp_true LOAD_ASSERT RAISE_VARARGS_1

rather than:

        assert ::= assert_expr jmp_true LOAD_NAME RAISE_VARARGS_1

LOAD_CONST can also be turned into LOAD_LAMBDA, LOAD_GENEXPR, LOAD_SETCOMP, or LOAD_LISTCOMP as appropriate.

Other processing turns JUMP_ABSOLUTE opcodes into CONTINUE, JUMP_FORWARD, or JUMP_BACK as appropriate; also RETURN_VALUE can be changed to RETURN_END_IF when appropriate.

In Python 2,CALL_FUNCTION opcode names are turned into CALL_FUNCTION_n where n is the number of positional arguments the function has. In Python 3 this transformation is also done, but it is done in a parsing customization phase described in the next section.

At the beginning of some basic blocks, COME_FROM instructions are inserted. In all cases the replaced instruction names are made up and are different from real CPython instruction names. For earlier Python versions this assisted a lot in able to write grammar rules to easily match control-structures. However as the jump optimization has gotten more advanced. We need to have more sophisticated rules for tracking some of this down.

Is all of this strictly necessary? In some cases such as for "assert", possibly not. But on the other hand possibly yes, as there is a greater likelihood of having several grammar rules match (ambiguity). And as I wrote above, I think it helps clarity. It probably speeds up parsing too.

Parsing

As just described, "scanning" or ingesting has already done a fair bit of set up to the tokens that get fed into the parser. Even with all of this, the code allows for even more parser setup before the parsing occurs. These are called "custom parse rules".

CPython bytecode is stack based: instructions push and pop values from an evaluation stack. For example CPython bytecode for a + b is

           0 LOAD_NAME               0 (a)
           3 LOAD_NAME               1 (b)
           6 BINARY_ADD

and the grammar rule that covers this is:

    binary_expr ::= expr expr binary_op

Some instructions like CALL_FUNCTION take an arbitrary number arguments. As mentioned in the last section, before parsing proper, we have changed the opcode to include the number of required positional arguments, e.g. CALL_FUNCTION_0. For a program to parse correctly, every instruction sequence has to be covered by a grammar rule.

But what grammar rule would you write for an instruction with an arbitrary number of arguments like CALL_FUNCTION?

We don't want to handle these with grammar rules like this:

    call_function ::= CALL_FUNCTION_0
    call_function ::= expr CALL_FUNCTION_1
    call_function ::= expr expr CALL_FUNCTION_2
    ...

which would slow down parsing immensely. And we'd also have to put a limit on the number of positional arguments allowed. So instead we add the grammar rules that specific to those kinds of calls that will be encountered in the instruction sequence to be parsed. That is, we will add:

    call_function ::= expr CALL_FUNCTION_1

only if we encounter a CALL_FUNCTION in the instruction sequence that requires exactly one positional argument. And we add that grammar rule only once, no matter how many single parameter functional calls there are.

Semantic Analysis

The semantic analysis pass is what generates Python source code or source code fragments from the parse tree produced by the parser. In uncompyle/decompile this is called the Walker class. In uncompyle6 there are two modules pysource and fragments and the main classes that walk the tree are calleed SourceWalker and FragmentsWalker. The two are similar and FragmentsWalker is a subclass of SourceWalker. However since FragmentsWalker's needs are a little different, it is a separate class and module.

As with parsing, the semantic analysis phase allows for custom rules to added before walking the tree. In particular, semantic actions for each of the custom function rules need a semantic-action counterpart.

The Walker classes are subclasses of GenericASTTraversal which is included in the spark module. The main function in there is the method to traverse the tree. It is called, somewhat erroneously, "preorder" For each node with typestring name name, if the instance has a method called n_name, call that before walking children. If there is no method defined, the method self.default() is called instead. the node has a method called name_exit, that is called after all children have been called. So in this sense this function is both preorder and postorder combined.

The default method used here and in the various uncompyle/decompyle derivatives uses three tables, MAP_DIRECT, MAP_R0, and MAP_R.

These uses a printf-like syntax to direct substitution from attributes of the nonterminal and its children..

Below we describe how table-driven semantic actions work and gives a list of the format specifiers. The default() and engine() methods implement most of the below.

First we determines a table (T) and a path to a table key (K) from the node type (N) (other nodes are shown as O):

   N                  N               N&K
 / | ... \          / | ... \        / | ... \
O  O      O        O  O      K      O  O      O
          |
          K

MAP_R0 (TABLE_R0)  MAP_R (TABLE_R)  MAP_DIRECT (TABLE_DIRECT)

The default is a direct mapping. The key _K _is then extracted from the subtree and used to find a table entry T[K], if any. The result is a format string and arguments (a la printf()) for the formatting engine.

Escapes in the format string are:

%c  evaluate children N[A] recursively*
%C  evaluate children N[A[0]]..N[A[1]-1] recursively, separate by A[2]*
%P  same as %C but sets operator precedence
%,  print ',' if last %C only printed one item (for tuples--unused)
%|  tab to current indentation level
%+ increase current indentation level
%- decrease current indentation level
%{...} evaluate ... in context of N
%% literal '%'
%p evaluate N setting precedence

* indicates an argument (A) required.

A% may be followed by a number (C) in square brackets, which makes the format engine walk down to N[C] before evaluating the escape code.

In the fragments deparser, we add an additional format specifier, %x. This takes an argument (src, (dest...)) and copies all of the range attributes from _src_to _dest+. For example in:

'importstmt': ( '%|import %c%x\n', 2, (2,(0,1)), ),

node 2 range information, indicated by %c above, is copied to nodes 0 and 1.

Clone this wiki locally