Skip to content

How does this code work?

R. Bernstein edited this page Apr 3, 2020 · 71 revisions

Table of Contents

When I first looked at uncompyle2 I was a little confused as to how this code worked, even though I am familiar with compilers and compiler technology.

Part of the confusion I think was due to the terminology used and how that is used compared to uses in other settings. Part I think was a little bit of hype on how deparsing is all table driven now, and part was just the novelty of everything.

So here I'd like to clarify terminology used and described how the code works.

Scanning

"Scanner" is a term used in programming-language compilers and usually refers to the token recognition from ASCII or unicode characters. Often tokens are described by regular expressions. Typical kinds of tokens you'll find are identifiers, reserved words, strings, numbers and so on.

Here, the input to the "scanner" is not ASCII characters but CPython bytecodes. And the tokens that the decompiler uses are CPython byte-code instruction names. Scanning here has very little or nothing to do with regular expressions.

So in between what the Earley parser sees and the bytecode that you start out is a "Disassembler". And from that, pick out the instruction name, and massage that a little bit. For lack of a better work, in the code I call the massaging "ingesting".

Massaging the token stream beyond disassembly is described below. But first let me start out with the bytecode.

In Python, the modules load and unmarshal from the marshal library do this; load reads a CPython-compiled file and unmarshal extracts this into a "code" object. To extract instructions from the code object, Python module dis can be used.

However we can't use these version-specific modules; we need modules that can handle bytecodes running from a different version of Python. We also need to deal with the fact that Python bytecode opcodes and the "code" types have changed. We handle both problems via another cross-version disassembly program called xdis.

As the term "disassembly" (in contrast to "scanning") implies, there's more to this than what could be done via regular expressions. Regular expression matching can be done in a single pass over the input. Routines in the xdis module, disassembly make more than a single pass. In particular, a pass is made over the instruction sequence to find the targets of jump instructions.

In sum, via the xdis package we get a series of structured bytecode instructions that we can use as tokens to start with.

Next we massage the instruction stream. This I call "ingestion". Like disassembly makes passes over its input, but now instead of bytecode bytes, it is over instruction stream produced by dissassembly.

Traditionally a parser only looks at a token stream. So there is benefit in putting useful information that may occur operands of an instruction into the token.

Consider for example this Python code:

    assert True

Assuming Python 2.7 bytecode, the Python disassembler library disassembles this as:

  1        0 LOAD_NAME               0 (True)
           3 POP_JUMP_IF_TRUE       12
           6 LOAD_GLOBAL             1 (AssertionError)
           9 RAISE_VARARGS           1
      >>  12 LOAD_CONST              0 (None)
          15 RETURN_VALUE

Note: the ">>" mark was added as a result of the pass mentioned before that looks for the targets of labels.

In our disassembler, we change the LOAD_GLOBAL (AssertionError) instruction into a made-up instruction we call LOAD_ASSERT.

This allows us to write a grammar rule:

        assert ::= assert_expr jmp_true LOAD_ASSERT RAISE_VARARGS_1

rather than:

        assert ::= assert_expr jmp_true LOAD_GLOBAL RAISE_VARARGS_1

Overall this makes the parsing faster since we've reduced the number of grammar rules that match. It also reduces the amount of grammar ambiguity.

LOAD_CONST can also be turned into LOAD_LAMBDA, LOAD_GENEXPR, LOAD_SETCOMP, or LOAD_LISTCOMP as appropriate.

Other processing turns JUMP_ABSOLUTE opcodes into CONTINUE, JUMP_FORWARD, or JUMP_BACK as appropriate. Currently in Python JUMP_BACK always indicates a looping jump. We also remove EXTENDED_ARG opcodes and fold that information into the instruction operand that follows. This causes the problem though of what to do with the removed offset. The label of the instruction is turned into a string which contains both offsets.

Below is an example of a fragment 1from simple_source/bug37/03_jump_to_jump.py. First we give a fragment of disassembly using pydisasm(from thexdis` package):

...
 16         236 LOAD_FAST                2 (tokens1)
            238 EXTENDED_ARG             1
            240 POP_JUMP_IF_FALSE      260

 17         242 LOAD_FAST                3 (tokens2)
            244 POP_JUMP_IF_FALSE      250

 18         246 JUMP_ABSOLUTE           18
            248 JUMP_FORWARD             8 (to 258)

 19     >>  250 LOAD_FAST                5 (f)
            252 EXTENDED_ARG             1
            254 POP_JUMP_IF_FALSE      264

After massaging, the instructions become:

  16       236  LOAD_FAST                'tokens1'
       238_240  POP_JUMP_IF_FALSE   260  'to 260'

  17       242  LOAD_FAST                'tokens2'
           244  POP_JUMP_IF_FALSE   250  'to 250'

  18       246  CONTINUE             18  'to 18'
           248  JUMP_FORWARD        258  'to 258'
         250_0  COME_FROM           244  '244'

  19       250  LOAD_FAST                'f'
       252_254  POP_JUMP_IF_FALSE   264  'to 264'

Notice how EXTENDED_ARG and POP_JUMP_IF_FALSE instructions have been combined into one instruction, and that the instruction at offset 246 (line 18) has been turned into CONTINUE.

Another set of tranformations that is done is to pull out count or length information in varable-argument instructions and add those counts to the operation name.

In Python 2,CALL_FUNCTION opcode names are turned into CALL_FUNCTION_n where n is the number of positional arguments the function has. In Python 3 this transformation is also done, but it is done in a parsing customization phase described in the next section.

The last class of transformations is to add additional instructions which the parser can use in matching and for checking the validity before appying a grammar reduction rule.

At the beginning of some basic blocks, COME_FROM instructions are inserted. In all cases the replaced instruction names are made up and are different from real CPython instruction names. For earlier Python versions this assisted a lot in able to write grammar rules to easily match control-structures. However as the jump optimization has gotten more advanced. We need to have more sophisticated rules for tracking some of this down.

Some of this might not be strictly necessary, but I think it helps clarity which is a big concern when you have to debug a problem. It can probably speed up parsing too.

While the above is great, over time it has become clear that an even higher level of abstraction is needed, especially when working across many versions of Pyton. Attributes like whether the instruction can jump, the ability to sequence forward and backwards in the instruction stream, a list of the places that point to this instruction, and in the future basic block and dominator and reverse dominator information for the instruction.

This still needs to be documented and in fact, the details need working out too.

Parsing

As just described, "scanning" or ingesting has already done a fair bit of set up to the tokens that get fed into the parser. Even with all of this, the code allows for even more parser setup before the parsing occurs. These are called "custom parse rules".

CPython bytecode is stack based: instructions push and pop values from an evaluation stack. For example CPython bytecode for a + b is

           0 LOAD_NAME               0 (a)
           3 LOAD_NAME               1 (b)
           6 BINARY_ADD

and the grammar rule that covers this is:

    binary_expr ::= expr expr binary_op

Some instructions like CALL_FUNCTION take an arbitrary number arguments. As mentioned in the last section, before parsing proper, we have changed the opcode to include the number of required positional arguments, e.g. CALL_FUNCTION_0. For a program to parse correctly, every instruction sequence has to be covered by a grammar rule.

But what grammar rule would you write for an instruction with an arbitrary number of arguments like CALL_FUNCTION?

We don't want to handle these with grammar rules like this:

    call_function ::= CALL_FUNCTION_0
    call_function ::= expr CALL_FUNCTION_1
    call_function ::= expr expr CALL_FUNCTION_2
    ...

which would slow down parsing immensely. And we'd also have to put a limit on the number of positional arguments allowed. So instead we add the grammar rules that are specific to those kinds of calls that will be encountered in the instruction sequence to be parsed. That is, we will add:

    call_function ::= expr CALL_FUNCTION_1

only if we encounter a CALL_FUNCTION in the instruction sequence that requires exactly one positional argument. And we add that grammar rule only once, no matter how many single parameter functional calls there are.

Semantic Analysis

The semantic analysis pass is what generates Python source code or source code fragments from the parse tree produced by the parser. In uncompyle/decompile this is called the Walker class. In uncompyle6 there are two modules pysource and fragments and the main classes that walk the tree are calleed SourceWalker and FragmentsWalker. The two are similar and FragmentsWalker is a subclass of SourceWalker. However since FragmentsWalker's needs are a little different, it is a separate class and module.

As with parsing, the semantic analysis phase allows for custom rules to added before walking the tree. In particular, semantic actions for each of the custom function rules need a semantic-action counterpart.

The Walker classes are subclasses of GenericASTTraversal which is included in the spark module. The main function in there is the method to traverse the tree. It is called, somewhat erroneously, "preorder" For each node with typestring name name, if the instance has a method called n_name, call that before walking children. If there is no method defined, the method self.default() is called instead. the node has a method called name_exit, that is called after all children have been called. So in this sense this function is both preorder and postorder combined.

For a detailed explanation of the machinations in semantic analysis see Table driven semantic actions.

Clone this wiki locally