-
-
Notifications
You must be signed in to change notification settings - Fork 434
How does this code work?
Table of Contents
When I first looked at uncompyle2 I was a little confused as to how this code worked, even though I am familiar with compilers and compiler technology.
Part of the confusion I think was due to the terminology used and how that is used compared to uses in other settings. Part I think was a little bit of hype on how deparsing is all table driven now, and part was just the novelty of everything.
So here I'd like to clarify terminology used and described how the code works.
"Scanner" is a term used in programming-language compilers and usually refers to the token recognition from ASCII or unicode characters. Often tokens are described by regular expressions. Typical kinds of tokens you'll find are identifiers, reserved words, strings, numbers and so on.
Here, the input to the "scanner" is not ASCII characters but CPython bytecodes. And the tokens that the decompiler uses are CPython byte-code instruction names. Scanning here has very little or nothing to do with regular expressions.
So in between what the Earley parser sees and the bytecode that you start out is a "Disassembler". And from that, pick out the instruction name, and massage that a little bit. For lack of a better work, in the code I call the massaging "ingesting".
Massaging the token stream beyond disassembly is described below. But first let me start out with the bytecode.
In Python, the modules load
and unmarshal
from the marshal
library do this; load
reads a CPython-compiled file and unmarshal
extracts this into a "code" object. To extract instructions from the code object, Python module dis can be used.
However we can't use these version-specific modules; we need modules that can handle bytecodes running from a different version of Python. We also need to deal with the fact that Python bytecode opcodes and the "code" types have changed. We handle both problems via another cross-version disassembly program called xdis.
As the term "disassembly" (in contrast to "scanning") implies, there's more to this than what could be done via regular expressions. Regular expression matching can be done in a single pass over the input. Routines in the xdis
module, disassembly make more than a single pass. In particular, a pass is made over the instruction sequence to find the targets of jump instructions.
In sum, via the xdis
package we get a series of structured bytecode instructions that we can use as tokens to start with.
Next we massage the instruction stream. This I call "ingestion". Like disassembly makes passes over its input, but now instead of bytecode bytes, it is over instruction stream produced by dissassembly.
Traditionally a parser only looks at a token stream. So there is benefit in putting useful information that may occur operands of an instruction into the token.
Consider for example this Python code:
assert True
Assuming Python 2.7 bytecode, the Python disassembler library disassembles this as:
1 0 LOAD_NAME 0 (True)
3 POP_JUMP_IF_TRUE 12
6 LOAD_GLOBAL 1 (AssertionError)
9 RAISE_VARARGS 1
>> 12 LOAD_CONST 0 (None)
15 RETURN_VALUE
Note: the ">>" mark was added as a result of the pass mentioned before that looks for the targets of labels.
In our disassembler, we change the LOAD_GLOBAL (AssertionError)
instruction into a made-up instruction we call LOAD_ASSERT
.
This allows us to write a grammar rule:
assert ::= assert_expr jmp_true LOAD_ASSERT RAISE_VARARGS_1
rather than:
assert ::= assert_expr jmp_true LOAD_GLOBAL RAISE_VARARGS_1
Overall this makes the parsing faster since we've reduced the number of grammar rules that match. It also reduces the amount of grammar ambiguity.
LOAD_CONST
can also be turned into LOAD_LAMBDA
, LOAD_GENEXPR
, LOAD_SETCOMP
, or LOAD_LISTCOMP
as appropriate.
Other processing turns JUMP_ABSOLUTE
opcodes into CONTINUE
, JUMP_FORWARD
, or JUMP_BACK
as appropriate. Currently in Python JUMP_BACK
always indicates a looping jump.
We also remove EXTENDED_ARG
opcodes and fold that information into the instruction operand that follows. This causes the problem though of what to do with the removed offset.
The label of the instruction is turned into a string which contains both offsets.
Below is an example of a fragment 1from simple_source/bug37/03_jump_to_jump.py. First we give a fragment of disassembly using
pydisasm(from the
xdis` package):
...
16 236 LOAD_FAST 2 (tokens1)
238 EXTENDED_ARG 1
240 POP_JUMP_IF_FALSE 260
17 242 LOAD_FAST 3 (tokens2)
244 POP_JUMP_IF_FALSE 250
18 246 JUMP_ABSOLUTE 18
248 JUMP_FORWARD 8 (to 258)
19 >> 250 LOAD_FAST 5 (f)
252 EXTENDED_ARG 1
254 POP_JUMP_IF_FALSE 264
After massaging, the instructions become:
16 236 LOAD_FAST 'tokens1'
238_240 POP_JUMP_IF_FALSE 260 'to 260'
17 242 LOAD_FAST 'tokens2'
244 POP_JUMP_IF_FALSE 250 'to 250'
18 246 CONTINUE 18 'to 18'
248 JUMP_FORWARD 258 'to 258'
250_0 COME_FROM 244 '244'
19 250 LOAD_FAST 'f'
252_254 POP_JUMP_IF_FALSE 264 'to 264'
Notice how EXTENDED_ARG
and POP_JUMP_IF_FALSE
instructions have been combined into one instruction, and that the instruction at offset 246 (line 18) has been turned into CONTINUE
.
Another set of tranformations that is done is to pull out count or length information in varable-argument instructions and add those counts to the operation name.
In Python 2,CALL_FUNCTION
opcode names are turned into CALL_FUNCTION_n
where n is the number of positional arguments the function has. In Python 3 this transformation is also done, but it is done in a parsing customization phase described in the next section.
The last class of transformations is to add additional instructions which the parser can use in matching and for checking the validity before appying a grammar reduction rule.
At the beginning of some basic blocks, COME_FROM
instructions are inserted. In all cases the replaced instruction names are made up and are different from real CPython instruction names. For earlier Python versions this assisted a lot in able to write grammar rules to easily match control-structures. However as the jump optimization has gotten more advanced. We need to have more sophisticated rules for tracking some of this down.
Some of this might not be strictly necessary, but I think it helps clarity which is a big concern when you have to debug a problem. It can probably speed up parsing too.
As just described, "scanning" or ingesting has already done a fair bit of set up to the tokens that get fed into the parser. Even with all of this, the code allows for even more parser setup before the parsing occurs. These are called "custom parse rules".
CPython bytecode is stack based: instructions push and pop values from an evaluation stack. For example CPython bytecode for a + b
is
0 LOAD_NAME 0 (a)
3 LOAD_NAME 1 (b)
6 BINARY_ADD
and the grammar rule that covers this is:
binary_expr ::= expr expr binary_op
Some instructions like CALL_FUNCTION
take an arbitrary number arguments. As mentioned in the last section, before parsing proper, we have changed the opcode to include the number of required positional arguments, e.g. CALL_FUNCTION_0
. For a program to parse correctly, every instruction sequence has to be covered by a grammar rule.
But what grammar rule would you write for an instruction with an arbitrary number of arguments like CALL_FUNCTION
?
We don't want to handle these with grammar rules like this:
call_function ::= CALL_FUNCTION_0
call_function ::= expr CALL_FUNCTION_1
call_function ::= expr expr CALL_FUNCTION_2
...
which would slow down parsing immensely. And we'd also have to put a limit on the number of positional arguments allowed. So instead we add the grammar rules that are specific to those kinds of calls that will be encountered in the instruction sequence to be parsed. That is, we will add:
call_function ::= expr CALL_FUNCTION_1
only if we encounter a CALL_FUNCTION
in the instruction sequence that requires exactly one positional argument. And we add that grammar rule only once, no matter how many single parameter functional calls there are.
The semantic analysis pass is what generates Python source code or source code fragments from the parse tree produced by the parser.
In uncompyle/decompile this is called the Walker
class. In uncompyle6 there are two modules pysource
and fragments
and the main classes that walk the tree are calleed SourceWalker
and FragmentsWalker
. The two are similar and FragmentsWalker is a subclass of SourceWalker. However since FragmentsWalker's needs are a little different, it is a separate class and module.
As with parsing, the semantic analysis phase allows for custom rules to added before walking the tree. In particular, semantic actions for each of the custom function rules need a semantic-action counterpart.
The Walker classes are subclasses of GenericASTTraversal which is included in the spark
module. The main function in there is the
method to traverse the tree. It is called, somewhat erroneously, "preorder" For each node with typestring name name, if the instance has a method called n_name, call that before walking
children. If there is no method defined, the method self.default() is called instead. the node has a method called name_exit, that is called after all children have been called. So in this sense this
function is both preorder and postorder combined.
For a detailed explanation of the machinations in semantic analysis see Table driven semantic actions.