-
-
Notifications
You must be signed in to change notification settings - Fork 434
How does this code work?
Table of Contents
- Introduction
- A Tale of Two Kinds of Decompilers
- Use of a Context-Free Grammar Parser
- Decompilation Phases
- Scanning
- Parsing
- Semantic Analysis
When I first looked at uncompyle2, I was a little confused as to how this code worked, even though I was very familiar with compilers and compiler technology.
Upon reflection, I think the idea of treating decompilation as a traditional compiler is a good one: it moves this kind of technology towards automation and into the realm of better-understood theory and technology where there has been a lot of research and engineering.
The success of uncompyle6 and decompile3 I think is due to this aspect. Using this approach I has been able to cover 30 or so separate versions of Python spanning over two decades. And it is able to produce such good results, while largely been maintained and updated by one person.
But from the standpoint of most decompilers, this approach is not mainstream and not the way most decompilers work.
And from the standpoint of traditional compilers, as described in the ground-breaking dragon book, this compiler is a little bit different as well. One has to take a step back and understand some of the more foundational principle of compilers, look at things a little bit differently, and adapt a little bit.
From the earliest days, there wasn't much written about how this decompiler worked, let along the broader idea of how compiling technology needed to be adapted for decompilation. To the extent this was understood by original authors, it was implicit, and definitely not widely communicated.
I hope to rectify that here.
If you look at decompiler on Uncyclopedia, there are sections on data-flow analysis and type analysis, that are just not needed in any Python decompiler for Python bytecode. That is because objects in a Python bytecode interpreter are the objects of Python; the types don't need to be resolved. Variables have the same name, scope and use as you find in the Python program.
The Uncyclopedia article is focused more on general purpose decompilers. These decompilers start from some machine language and give you some result independent of what programming language you started out with.
In order to do this, the result is often in a "high-level" lanaguage of the decompiler's choice. There may be some slight syntactic sugar to make that language look more like a particular high-level language, but you probably won't be able to run the result unmodified through a compiler or interpreter for that high-level language.
In contrast, this Python decompiler decompiles to Python, and specifically the Python version that the bytecode says it was compiled from. It is in a class that is not yet described or circumcised in the Uncyclopedia article of 2022. I think of it as in the class of bytecode for high-level dynamic languages. Ruby and Emacs Lisp are also in this category. I write about this a little in Decompilation at Runtime and the Design of a Decompiler for Dynamic Interpreted Languages.
In Python, or any given version of Python, there is more or less one translation from Pyton to bytecode. PyPy alters or adds a few opecodes, but for the most part, its translation to bytecode is exactly the same as CPython's. Graal and Jython are different and compile to JVM. This decompiler doesn't make an attempt to deparse JVM back to Python.
Where we sometimes see variation is in the interpreter. Specifically Pyston and PyPy are JITin'g interpreters. Because that is below the level of Python bytecode, decompilation for them works the same (or pretty much the same for PyPy) as it does for CPython.
We use a grammar to parse Python bytecode back to Python.
There are a number of factors that make this possible and "easy" to do:
- For any given version of Python there is basically one kind of translation from Python to Python bytecode, specifically the CPython translation.
- Python bytecode is pretty high-level
- There is relatively short grammars for both Python and Python's AST which we can use to guide us in writing such a grammar for Python bytecode back to Python.
Although we use a grammar, it is an ambiguous grammar. I repeat: the grammar is not LALR(1), LR, SLR, LL(1), etc. It is not something you can use in Yacc or Bison.
For the most part, the gramma is an LR grammar, because the particular Earley algorithm parser used is pretty efficient at handling that. See again Decompilation at Runtime and the Design of a Decompiler for Dynamic Interpreted Languages for more details on this kind of parser and potential problems with it.
The use of an ambiguous grammer is frowned on in designing a programming language. In writing code in a programming language there shouldn't be ambiguity in what is intended. However when you decompile bytecode or machine language there can be several valid representations of that bytecode in Python. Something might decompile to:
if a and b:
y = 1
or
if a:
if b:
y = 1
And both of these are equally valid. Writing a decompilation grammar is hard enough without the burden of trying to make it unambiguous, if that can be done at all.
This decompiler has 3 distinct phases:
-
Scanning to a Token stream
-
Parsing to a Parse Tree
-
Python Code Generation to Python Source Code
These are described below. Each phase has subparts which we will describe in the respective section.
The process is kicked off by calling Code Generation which first calls the scanner, gets results back from that and passes that on to the parsing, gets an tree back from that and then produces Python source code.
"Scanner" is a term used in programming-language compilers and usually refers to the token recognition from ASCII or unicode characters. Often tokens are described by regular expressions. Typical kinds of tokens you'll find are identifiers, reserved words, strings, numbers and so on.
In the decompiler, the input to the scanner is not ASCII characters, but CPython bytecodes which we first disassemble into bytecode instructions.
In Python, the modules load
and unmarshal
from the marshal
library assist in this process; load
reads a CPython-compiled file and unmarshal
extracts this into a "code" object. To extract instructions from the code object Python module dis can be used.
However these modules assume that the bytecode you want to disassemble is the same bytecode as the running Python interpreter. When this is not the case we can't used these modules. Furthermore, earlier versions of the _dis_module do not provide the convenient instruction abstraction. And when using a Python interpreter that is running bytecode different from the bytecode that is to be decompiled, we need to deal with the fact that Python the "code" type, the Python object that holds the bytecode, can vary. We handle all of these problems using a cross-version disassembly library called xdis.
As the term "disassembly" (in contrast to "scanning") implies, there is more to this than what could easily be done via regular expressions alone. Regular-expression matching can be done in a single pass over the input. Routines in the xdis
module make more than a single pass over the extracted bytecode. In particular, a pass is made over the instruction sequence to find the targets of jump instructions.
While newer versions of the decompiler, decompyle3 in particular, use the richer instruction format that xdis provides, in uncompyle6 there are still remnants of the Python bytecode format in various scanners for the older Python versions. This tends to make the older code a little more awkward to work with. This is especially true if you look at the scanner code in uncompyle2
.
As an example, the Python 2.6 scanner, scanner26.py
, that is in uncompyle6, you may see a reference to self.code[offset+3]
. The magic value offset+3
at that point in the code refers to the offset of next instruction. The value 3 to increment is determined by value of the opcode at self.code[offset]
. In this place in the code, the opcode is JUMP_ABSOLUTE
; the JUMP_ABSOLUTE
opcode with its two-byte operand is a total of 3 bytes in Python 2.6.
Newer code works with instructions rather than bytecode. This simplfies things: at instruction i
the next instruction is always self.insts[i+1]
no matter what the opcode value is or what Python version the bytecode comes from.
In the first part of the scanner, it then converts a bytecode stream (a sequence of bytes) into bytecode instructions via xdis
. In this conversion EXTENDED_ARG
opcodes and information associated with that e.g. its offset value) if folded into the instruction operand that follows.
This causes the problem though of what to do with the removed offset. The label of the instruction is turned into a string which contains both offsets.
After converting the bytecode stream into bytecode instructions, the bytecode instructions are munged or ingested into a token stream. In this stream, the name part of the token used in matched on in the parser. Both the bytecode instruction stream and the token stream use the instruction format that xdis uses to produces the bytecode instructions.
We describe next the difference between a bytecode instruction and a token (instruction).
In a traditional compiler, the end result of scanning ASCII or unicode characters is a sequence of tokens which is a classifies each of the string fragments seen in the source code while also retaining other information. The idea here is that at the next level, parsing, you don't care about whether the string segment was the string "3" or "5", but rather that both of these are of the same category, here, a number. So the token name is given its category rather than the string fragment value. Other categories might be string, reserved word, or identifier. From the parser's standpoint, matching is done on the token category, called the token's name, not its specific value, e.g. "3" or "5". Actually the value here for a number would be the number representation not its string representation.
To some extent, the categorization process isn't absolutely necessary, e.g. you could add additional grammar rules for a number. But doing this simplifies things, makes the grammar smaller, and makes it easier to follow the compilation process.
Note that even though the grammar parses on the token name, it still can get access to the specific value that was seen. In particular, it needs to access this value when the compiler goes to produces code instructions. But the idea is that the way it would produce code for working with the number 3 versus 5 would be the same except the the value in an operand would be different, e.g. 3 versus 5.
In a decompiler the token name is often the instruction's opcode name as found in the bytecode. However, there are some tweaks that we need to pay attention to.
Some bytecode opcode names are general; there is benefit by classifying this opcode name further.
An example of this is LOAD_CONST
. The operand here can be a number, a string, a dictionary, a code object, an assertion, or other things.
In some grammar rules the constant has to be a specific kind. For example, the rule for calling functions needs a code object at a particular place in the rule. While it is acceptable to use LOAD_CONST
for that code object, things are clearer and less ambiguous (and thus faster) if we can match only against constants that are code objects. So when the scanner detects an LOAD_CONST
instruction with a code object operand, it will use the token name LOAD_CODE
instead of the opcode name LOAD_CONST
.
LOAD_CONST
can also be turned into LOAD_ASSERT
, LOAD_LAMBDA
, LOAD_GENEXPR
, LOAD_SETCOMP
, or LOAD_LISTCOMP
as appropriate.
Here are other opcode name specialization that can happen: JUMP_ABSOLUTE
can turn into CONTINUE
, JUMP_FORWARD
, JUMP_BACK
, or BREAK
as appropriate. Currently in Python, JUMP_BACK
always indicates a looping jump. CONTINUE
and BREAK
also have to appear in loops.
To speed parsing up greatly, and to disambiguate potential parsing ambiguity, the decompiler creates custom grammar rules per code object (function, lambda, generator, or comprehension). These custom rules are triggered when the scanner finds specific instructions in the instruction stream. Some of the terminal symbols in these custom rules often contain a token name that is fabricated from the opcode plus information that is only available in the operand.
An example of this is found in bytecode to build a list. The opcode name used in the bytecode here is BUILD_LIST
. If a particular build-list instruction has an operand value of 2 (this pops two items of of the evaluation stack and the list will contain those two values), then scanner will emit a token instruction whose token name is the opcode and the operand value folded into the the resulting token name. Here, the token name would be BUILD_LIST_2
which effectively replaces the BUILD_LIST
instruction.
In massaging the bytecode instruction stream into a token stream, a common theme we see if that information from somewhere is made explicit in the token name. And above this information has been derived from the operand value of an instruction.
There is one broad area though where the additional information that needs to be made explicit in the token stream is control flow information.
COME_FROM vs Basic Block and Dominator information
to be continued....
As just described, "scanning" or ingesting has already done a fair bit of set up to the tokens that get fed into the parser. Even with all of this, the code allows for even more parser setup before the parsing occurs. These are called "custom parse rules".
CPython bytecode is stack based: instructions push and pop values from an evaluation stack. For example CPython bytecode for a + b
is
0 LOAD_NAME 0 (a)
3 LOAD_NAME 1 (b)
6 BINARY_ADD
and the grammar rule that covers this is:
binary_expr ::= expr expr binary_op
Some instructions like CALL_FUNCTION
take an arbitrary number arguments. As mentioned in the last section, before parsing proper, we have changed the opcode to include the number of required positional arguments, e.g. CALL_FUNCTION_0
. For a program to parse correctly, every instruction sequence has to be covered by a grammar rule.
But what grammar rule would you write for an instruction with an arbitrary number of arguments like CALL_FUNCTION
?
We don't want to handle these with grammar rules like this:
call_function ::= CALL_FUNCTION_0
call_function ::= expr CALL_FUNCTION_1
call_function ::= expr expr CALL_FUNCTION_2
...
which would slow down parsing immensely. And we'd also have to put a limit on the number of positional arguments allowed. So instead we add the grammar rules that are specific to those kinds of calls that will be encountered in the instruction sequence to be parsed. That is, we will add:
call_function ::= expr CALL_FUNCTION_1
only if we encounter a CALL_FUNCTION
in the instruction sequence that requires exactly one positional argument. And we add that grammar rule only once, no matter how many single parameter functional calls there are.
The semantic analysis pass is what generates Python source code or source code fragments from the parse tree produced by the parser.
However before turning the parse tree into source text there is a "transform" phase which transforms idioms of the tree (and hence source code) into other idioms (and hence presumably more idoimatic source code).
Here are some kinds of transformations that occur:
if x:
a
else:
if y:
b
becomes:
if x:
a
elif y:
b
Or:
if not a:
raise AssertionError
becomes:
assert a
Docstrings that are found in the code object are added as a child node of the thing the docstring is documenting.
In uncompyle/decompile the class that walks the parse tree to produce source text is the Walker
class in the Walker module. In uncompyle6 there are two modules pysource
and fragments
and the main classes that walk the tree are calleed SourceWalker
and FragmentsWalker
. The two are similar and FragmentsWalker is a subclass of SourceWalker. However since FragmentsWalker's needs are a little different, it is a separate class and module.
As with parsing, the semantic analysis phase allows for custom rules to added before walking the tree. In particular, semantic actions for each of the custom function rules need a semantic-action counterpart.
The Walker classes are subclasses of GenericASTTraversal which is included in the spark
module. The main function in there is the
method to traverse the tree. It is called, somewhat erroneously, "preorder" For each node with typestring name name, if the instance has a method called n_name, call that before walking
children. If there is no method defined, the method self.default() is called instead. the node has a method called name_exit, that is called after all children have been called. So in this sense this
function is both preorder and postorder combined.
For a detailed explanation of the machinations in semantic analysis see Table driven semantic actions.