-
-
Notifications
You must be signed in to change notification settings - Fork 434
How does this code work?
When I first looked at uncompyle2 I was a little confused as to how this code worked, even though I am familiar with compilers and compiler technology.
Part of the confusion I think was due to the terminology used and how that is used compared to uses in other settings. Part I think was a little bit of hype on how deparsing is all table driven now, and part was just the novelty of everything.
So here I'd like to clarify terminology used and described how the code works.
"Scanner" is a term used in programming-language compilers and usually refers to the token recognition from ASCII or unicode characters. Often tokens are described by regular expressions. Typical kinds of tokens you'll find are identifiers, reserved words, strings, numbers and so on.
Here, the input to the "scanner" is not ASCII characters but CPython bytecodes. And the tokens are CPython byte-code instructions. Scanning here has very little or nothing to do with regular expressions.
I think a more appropriate word here would be "Disassembler". At any rate that's how the various scanners here function.
So already I have a little explaining to do for how we get the bytecodes. That is done via the Python modules load
and marshal
; load
reads CPython compiled file and marshal
extracts this into a "code" object.
If we only needed to desparse CPython bytecodes from CPython running the same version, code for this package could be shortened by using Python's library modules such as the disassembly module dis, or the Marshal library marshal; types.CodeType() can be used to create a new Python Code object and inspect.iscode can be used to check to see if the object is Python code.
But to support deparsing across Python versions, we need to deal with the fact that Python bytecode opcodes and the "code" types have changed. Therefore we need to routines equivalent routines those for each version. And this code needs to be written in a way that it can be run on several Python versions.
Note that this additional code for cross-version interpretation can also be a virtue as we could provide a cross-python-version marshal
module.
Now that the issue of getting bytecodes has been addressed we can get back to "scanning" or rather: disassembly. As the the term "disassembly" implies (in contrast to "scanning"), there's more to this than what could be done via regular expressions. Regular expression matching can be done in a single pass over the input. Routines in Python's dis
module, disassembly makes more than a single pass. In particular, a pass is made over the instruction sequence to find the targets of jump instructions.
In this package there are a couple of additional passes in addition to finding the targets of jump instructions. This is done to make parsing more reliable and to allow us to a write clearer grammar for the lower-level parsing.
Consider for example this Python code:
assert True
The Python disassembler library disassembles this as:
1 0 LOAD_NAME 0 (True)
3 POP_JUMP_IF_TRUE 12
6 LOAD_GLOBAL 1 (AssertionError)
9 RAISE_VARARGS 1
>> 12 LOAD_CONST 0 (None)
15 RETURN_VALUE
Note: the ">>" mark was added as a result of the pass mentioned before that looks for the targets of labels.
In our disassembler, we change the LOAD_GLOBAL (AssertionError)
instruction into a made-up instruction we call LOAD_ASSERT
.
This allows us to write a grammar rule:
assert ::= assert_expr jmp_true LOAD_ASSERT RAISE_VARARGS_1
rather than:
assert ::= assert_expr jmp_true LOAD_NAME RAISE_VARARGS_1
LOAD_CONST
can also be turned into LOAD_LAMBDA
, LOAD_GENEXPR
, LOAD_SETCOMP
, or LOAD_LISTCOMP
as appropriate.
Other processing turns JUMP_ABSOLUTE
opcodes into CONTINUE
, JUMP_FORWARD
, or JUMP_BACK
as appropritate; also RETURN_VALUE
can be changed to RETURN_END_IF
when appropriate.
In Python 2,CALL_FUNCTION
opcode names are turned into CALL_FUNCTION_n
where n is the number of positional arguments the function has. In Python 3 this transformation is also done, but it is done in a parsing customization phase described in the next section.
At the beginning of some basic blocks, COME_FROM
instructions are inserted. In all cases the replaced instruction names are made up and are different from real CPython instruction names.
Is all of this strictly necessary? In some cases such as for "assert", probably not. But as I wrote above, I think it helps clarity. It probably speeds up parsing too.
As just described, "scanning" has already done a fair bit of set up to the tokens that get fed into the parser. Even with all of this, the code allows for even more parser setup before the parsing occurs. These are called "custom parse rules".
CPython bytecode is stack based: instructions push and pop values from an evaluation stack. For example CPython bytecode for a + b
is
0 LOAD_NAME 0 (a)
3 LOAD_NAME 1 (b)
6 BINARY_ADD
and the grammar rule that covers this is:
binary_expr ::= expr expr binary_op
Some instructions like CALL_FUNCTION
take an arbitrary number arguments. As mentioned in the last section, before parsing proper, we have changed the opcode to include the number of required positional arguments, e.g. CALL_FUNCTION_0
. For a program to parse correctly, every instruction sequence has to be covered by a grammar rule.
But what grammar rule would you write for an instruction with an arbitrary number of arguments like CALL_FUNCTION
?
We don't want to handle these with grammar rules like this:
call_function ::= CALL_FUNCTION_0
call_function ::= expr CALL_FUNCTION_1
call_function ::= expr expr CALL_FUNCTION_2
...
which would slow down parsing immensely. And we'd also have to put an limit on the number of positional arguments allowed. So instead we add the grammar rules that specific to those kinds of calls that will be encountered in the instruction sequence to be parsed. That is, we will add:
call_function ::= expr CALL_FUNCTION_1
only if we encounter a CALL_FUNCTION
in the instruction sequence that requires exactly one positional argument. And we add that grammar rule only once, no matter how many single parameter functional calls there are.
The semantic analysis pass is what generates Python source code or source code fragments from the parse tree produced by the parser.
In uncompyle/decompile this is called the Walker
class. In uncompyle6 there are two modules pysource
and fragments
and the main classes that walk the tree are calleed SourceWalker
and FragmentsWalker
. The two are similar and FragmentsWalker is a subclass of SourceWalker. However since FragmentsWalker's needs are a little different, it is a separate class and module.
As with parsing, the semantic analysis phase allows for custom rules to added before walking the tree. In particular, senantic actions for each of the custom function rules need a semantic-action counterpart.
The Walker classes are subclasses of GenericASTTraversal which is included in the spark
module. The main function in there is the
method to traverse the tree. It is called, somewhat erroneously, "preorder" For each node with typestring name name, if the instance has a method called n_name, call that before walking
children. If there is no method defined, the method self.default() is called instead. the node has a method called name_exit, that is called after all children have been called. So in this sense this
function is both preorder and postorder combined.
The default method used here and in the various uncompyle/decompyle derivatives uses three tables, MAP_DIRECT
, MAP_R0
, and MAP_R
.
These uses a printf-like syntax to direct substitution from attributes of the nonterminal and its children..
Below we describe how table-driven semantic actions work and gives a list of the format specifiers. The default() and engine() methods implement most of the below.
First we determines a table (T) and a path to a table key (K) from the node type (N) (other nodes are shown as O):
N N N&K
/ | ... \ / | ... \ / | ... \
O O O O O K O O O
|
K
MAP_R0 (TABLE_R0) MAP_R (TABLE_R) MAP_DIRECT (TABLE_DIRECT)
The default is a direct mapping. The key _K _is then extracted from the subtree and used to find a table entry T[K], if any. The result is a format string and arguments (a la printf()) for the formatting engine.
Escapes in the format string are:
%c evaluate children N[A] recursively*
%C evaluate children N[A[0]]..N[A[1]-1] recursively, separate by A[2]*
%P same as %C but sets operator precedence
%, print ',' if last %C only printed one item (for tuples--unused)
%| tab to current indentation level
%+ increase current indentation level
%- decrease current indentation level
%{...} evaluate ... in context of N
%% literal '%'
%p evaluate N setting precedence
* indicates an argument (A) required.
A%
may be followed by a number (C) in square brackets, which makes the format engine walk down to N[C]
before evaluating the escape code.
In the fragments deparser, we add an additional format specifier, %x
. This takes an argument (src, (dest...))
and copies all of the range attributes
from _src_to _dest+. For example in:
'importstmt': ( '%|import %c%x\n', 2, (2,(0,1)), ),
node 2 range information, indicated by %c
above, is copied to nodes 0 and 1.