Skip to content

Adding a tree transformation phase to uncompyle6

R. Bernstein edited this page Jul 12, 2018 · 20 revisions

Table of Contents

Adding a tree-transformation phase to uncompyle6

The problem

In Issue #178 "Poor if statement formatting" it was noted that

if True:
    pass
elif False:
    pass
elif not True:
    pass

is decompiled from bytecode to

if True:
    pass
else:
    if False:
        pass
    else:
        if not True:
            pass

Thoughts around fixing and a note about AST's.

My thoughts on how to structure a decompiler are in Decompilation at Runtime and the Design of a Decompiler for Dynamic Interpreted Languages and in that you'll see in Figure 1 that there is an optional phase in between tree building and final text generation a box called "Macro expansion" where you can transform the tree according to your liking. Here,in that phase we would transform the else: if into elsif while adjusting the indentation. Currently uncompyle6 has no such phase.

Alexander Sieusahai has expressed interest in undertaking this. Thanks Alex! So here is a writeup of my current thoughts around this phase.

Alex mentions though "ast", and already there is a possible source of confusion here. First, the tree we produce is "abstract" in the sense that it imposes a structure on top of the instructions. And yes, it definitely is a "tree". But it is not abstract in the other sense that that it is used to mean in that it is compact to the point of having only the essential nodes and removes unnecessary syntax. In the case of the source code that would be semicolons or newlines. In our situation where we start out with an the opcode names of an instruction stream, we keep all the tokens intact.

Also, it is not the same thing as Python's AST. We do however try to keep the nonterminal or tree node names as similar as possible. So while uncompyle2 may have a tree name like designator, in uncompyle6 it is called store to match the corresponding Python AST name Store. Note that in contrast to AST's CamelCase use lowercase snake_case to remind people that this ain't Python's AST. (Are there any volunteers out there to write a translator to convert uncompyle6's tree into Python's AST?)

Design of a Tree-Transformation Phase

Ok, so with all of that stuff out of the way, let's get to work on this!

We basically want to transform one tree into another. This will mean adding another phase to uncompyle6, but to start out let's just add this to the semantics directory, and perhaps later we'll split that out.

Let's look at the tree with this cool recently-added option --tree+ where I include the templating instructions. For the code above you get:

    ifelsestmt (5): ('%|if %c:\n%+%c%-%|else:\n%+%c%-', 0, 1, 3)
         0. testexpr
            testfalse (2)
                 0. expr
                    L.   1       0  LOAD_NAME             0  'True'
                 1. jmp_false (2)
                     0.              3  JUMP_IF_FALSE         4  'to 10'
                     1.              6  POP_TOP
         1. pass: ('%|pass\n',)
         2. jf_cf_pop (3)
             0. L.   2       7  JUMP_FORWARD         23  'to 33'
             1.           10_0  COME_FROM             3  '3'
             2.             10  POP_TOP
         3. else_suite
            suite_stmts
                stmt
                    ifelsestmt (5): ('%|if %c:\n%+%c%-%|else:\n%+%c%-', 0, 1, 3)
                         0. testexpr
                            testfalse (2)
         4.           33_2  COME_FROM             7  '7'
...

Table-driven semantic actions describes how to interpret the cryptic stuff that starts out ('%|if %c

Boiling this down a bit we see the that basic templating rule used is: ifelsestmt: ('%|if %c:\n%+%c%-%|else:\n%+%c%-', 0, 1, 3)

The (5) previously shown after ifelsestmt just means that there are 5 children. But only use 3 of them, at index 0, 1, and 3 respectively, are used. An "Abstract" Syntax tree would have removed children at index 2 and 4.

If you look at uncompyle6/semantics/consts.py around line 264 you'll see

    'ifelifstmt':	( '%|if %c:\n%+%c%-%c', 0, 1, 3 ),

so this is what we want to get used instead. So what needs to get done is to write a rule in some as-yet-to-be implemented tree transformation system. Such a rule would indicate that if we see a tree pattern like:

ifelsestmt [testexpr testfalse . else_suite [suite_stmts [stmt [ifelsestmt]]] . ]

transform that to:

ifelifstmt [testexpr testfalse . ifelsestmt]

A little about the jargon used above to direct tree transformation. Something like ifelsestmt represents a AST object defined in uncompyle6/parsers/astnode. (And yes, I see that name is poorly named. I will have to change that sometime.) The "kind" field of the object would change from ifelsestmt to ifelifstmt. The brackets indicate that what is enclosed in the brackets are children of the tree node that precede it. And the dot means that you can match any AST object.

I'm sure code to do this kind of thing has been written many times, especially in the LISP community. For example Emacs Lisp has a pcase pattern matching macro. The thing can be done stupidly by iterating over the entire tree for each kind of transform. More clever though is to iterate once and try the set of patterns on each node.

Other places this can be used

The next most glaring situation is where uncompyle6 produces

if not expr:
   RaiseAssertion ...

instead of:

assert expr

Finally I'll point out that there already is a bit of tree hackery inside uncompyle/semantics/pysource.py to remove an assignment to __doc__ from the tree, (see around line 2121) which has:

   if do_doc and self.hide_internal:

or extraneous return statements from the end of the main program, the end of the module, or the end of some functions; See function is_return_none() in pysource.py.

If we had the better tree transformation phase and the rules in a nice dictionary, all of that code would be much cleaner.

Clone this wiki locally