For the past month or so, I’ve been trying to understand what appears to be a black art mostly because of lacking documentation – Python bytecode generation and peephole optimization. Some notes from the study for the benefit of IRC-mate ‘jstatm’ and anyone else living on similar planes of insanity.
Although bytecode applies to objects other than functions such as tracebacks, dictionaries and strings, I am only interested in optimization and flow analysis of class methods, functions and lambdas. Lets get to the action straight away with an example. Here is a small python program to disassemble and display the bytecode of a function in human readable form.
#!/usr/bin/python
import sys
import opcode
# A simple function to disassemble and ponder over
# Sum of even numbers from 0 - 100
def foo_simple():
sum = 0
# For the heck of creating a closure in the byte-code
# use (n+sum) instead of n. Even + Even = Even in any case
for i in filter(lambda n: ( (n+sum) % 2 == 0), range(101)):
sum += i
if 1 > 0:
print 'Nobody Expects The Spanish Inquisition !'
return sum
# The Byte-Code decompiler. Returns a list of tuples that
# represent the opcode and arguments in the code object
# Reference: The python dis module
def decompile(code_object):
code = code_object.co_code
variables = code_object.co_cellvars + code_object.co_freevars
instructions = []
n = len(code)
i = 0
e = 0
while i < n:
i_offset = i
i_opcode = ord(code[i])
i = i + 1
if i_opcode >= opcode.HAVE_ARGUMENT:
# WTF: I can't type 8 in wordpress
# It gets converted to a frigging smiley !
i_argument = ord(code[i]) + (ord(code[i+1]) << (4*2)) + e
i = i +2
if i_opcode == opcode.EXTENDED_ARG:
e = iarg << 16
else:
e = 0
if i_opcode in opcode.hasconst:
i_arg_value = repr(code_object.co_consts[i_argument])
i_arg_type = 'CONSTANT'
elif i_opcode in opcode.hasname:
i_arg_value = code_object.co_names[i_argument]
i_arg_type = 'GLOBAL VARIABLE'
elif i_opcode in opcode.hasjrel:
i_arg_value = repr(i + i_argument)
i_arg_type = 'RELATIVE JUMP'
elif i_opcode in opcode.haslocal:
i_arg_value = code_object.co_varnames[i_argument]
i_arg_type = 'LOCAL VARIABLE'
elif i_opcode in opcode.hascompare:
i_arg_value = opcode.cmp_op[i_argument]
i_arg_type = 'COMPARE OPERATOR'
elif i_opcode in opcode.hasfree:
i_arg_value = variables[i_argument]
i_arg_type = 'FREE VARIABLE'
else:
i_arg_value = i_argument
i_arg_type = 'OTHER'
else:
i_argument = None
i_arg_value = None
i_arg_type = None
instructions.append( (i_offset, i_opcode, opcode.opname[i_opcode],\
i_argument, i_arg_type, i_arg_value) )
return instructions
# Print the byte code in homo-sapien compatible format
def pretty_print(instructions):
print '%5s %-20s %3s %5s %-20s %s' % \
('OFFSET', 'INSTRUCTION', 'OPCODE', 'ARG', 'TYPE', 'VALUE')
for (offset, op, name, argument, argtype, argvalue) in instructions:
print '%5d %-20s (%3d) ' % (offset, name, op),
if argument != None:
print '%5d %-20s (%s)' % (argument, argtype, argvalue),
print
def main():
pretty_print(decompile(foo_simple.func_code))
# print foo_simple()
if __name__ == '__main__':
main()
#:~
If you look at the generated code carefully, you will notice that the CPython virtual machine is heavily stack based. The bytecode of
def foo_stupid():
i = 0
i = i + 1
return i
would look something like this
OFFSET INSTRUCTION OPCODE ARG TYPE VALUE 0 LOAD_CONST (100) 1 CONSTANT (0) 3 STORE_FAST (125) 0 LOCAL VARIABLE (i) 6 LOAD_FAST (124) 0 LOCAL VARIABLE (i) 9 LOAD_CONST (100) 2 CONSTANT (1) 12 BINARY_ADD ( 23) 13 STORE_FAST (125) 0 LOCAL VARIABLE (i) 16 LOAD_FAST (124) 0 LOCAL VARIABLE (i) 19 RETURN_VALUE ( 83)
Some of you would have noticed that the instruction sequence looks senile. (STORE, LOAD) sequences on the same variable such as (3,6) and (13,16) are totally pointless. These are clear candidates for optimization. In fact if we keep building intelligence into the byte-code optimizer, it could eventually output something like
OFFSET INSTRUCTION OPCODE ARG TYPE VALUE 0 LOAD_CONST (100) 1 CONSTANT (1) 3 RETURN_VALUE ( 83)
(For comparison, try writing a C program equivalent to the foo_stupid() function and compile it with gcc -O3 -S and look at the generated assembly. It would simply be a mov $1, %eax followed by a ret for the 32 bit x86 ABI. No frame pointer thank you !)
Quite astonishingly, running the program with python -O (byte code optimization enabled) did nothing to change the emitted byte code for foo_stupid(). Running through optimize_code(), the peephole optimizer in Python/compile.c confirms that foo_stupid() was a bad example to choose !
And lastly, I should mention a couple of projects that relate to Python and Bytecode Optimization in loosely coupled ways:
Psyco:
Psyco’s Just in time specialization is a good example of the kind of optimization that can make a serious difference. psyco emits native machine code during the JIT phase. It uses runtime profiling to find out type information about objects and starts dynamically patching machine code at runtime based on the profiler’s feedback. In simple terms if you have a function like
def f(a, b): return (2*a + 3*b)
Duck typing prevents a lot of optimization because a,b can be integers, long integers, strings, lists, tuples, floats or even complex numbers. If a function dispatch occurs with integer arguments for a and b, pysco creates a version of f that is optimized for integer values of a and b. Further calls with integer arguments jump to this version. Later if you call the same function with floating point arguments, another optimized version of the function is emitted. It hogs memory but runs way too faster than the CPython VM instructions. My only pet-peeve is that the machine code is patched in memory and there is no way to store it to the disk.
PyPy:
The most interesting python project in this area currently is PyPy. PyPy is a self-hosting python implementation that translates Python to lower level languages. PyPy is written in python itself or rather a restricted subset called RPython (Which omits some of the dynamic aspects of python). The best part about PyPy is that the translation into low-level languages is “pluggable”. PyPy can produce CPython or Jython or IronPython or any arbitrary Intermediate code format. Very slick !
I am particularly interested in the LLVM code generation capabilites of PyPy. The Low-Level Virtual Machine (LLVM) project itself is very mature and does some serious optimization of its intermediate code. And when coupled with the fact that it has backends for x86/64, SPARC, Power, Alpha and ARM, this gives us the enviable prospect of translating python code into native machine code.
A very good tool to experiment with the CPython bytecode is the aptly named byteplay project: http://code.google.com/p/byteplay/
Not to forget the excellent peak.util.Assembler
Filed under: Open Source, Programming, Python
Though I didn’t get most part of what you wrote, I tried this for the following function.
def foo_simple(a,b):
if(a>10):
return(1);
return(foo_simple(2*a,b) + 3*b)
and this is what I got
OFFSET INSTRUCTION OPCODE ARG TYPE VALUE
0 LOAD_FAST (124) 0 LOCAL VARIABLE (a)
3 LOAD_CONST (100) 1 CONSTANT (10)
6 COMPARE_OP (106) 4 COMPARE OPERATOR (>)
9 JUMP_IF_FALSE (111) 8 RELATIVE JUMP (20)
12 POP_TOP ( 1)
13 LOAD_CONST (100) 2 CONSTANT (1)
16 RETURN_VALUE ( 83)
17 JUMP_FORWARD (110) 1 RELATIVE JUMP (21)
20 POP_TOP ( 1)
21 LOAD_GLOBAL (116) 0 GLOBAL VARIABLE (foo_simple)
24 LOAD_CONST (100) 3 CONSTANT (2)
27 LOAD_FAST (124) 0 LOCAL VARIABLE (a)
30 BINARY_MULTIPLY ( 20)
31 LOAD_FAST (124) 1 LOCAL VARIABLE (b)
34 CALL_FUNCTION (131) 2 OTHER (2)
37 LOAD_CONST (100) 4 CONSTANT (3)
40 LOAD_FAST (124) 1 LOCAL VARIABLE (b)
43 BINARY_MULTIPLY ( 20)
44 BINARY_ADD ( 23)
45 RETURN_VALUE ( 83)
[...] sigfpe wrote an interesting post today onHere’s a quick excerpt [...]
Interesting post. I’m just starting to look into Java bytecode and most of the stuff in your post is relevant. Check out my post on binary translation. It’s a really high-level overview but I know the JIT compilers use it to make things blazing fast, (near native.)
- cory
The stdlib includes a disassembler: http://python.org/doc/2.5/lib/module-dis.html
And PyPI lists my assembler: http://pypi.python.org/pypi/BytecodeAssembler
To complement Phillip’s posting: Instead of using this huge decompile function, try:
def foo_simple():
[...]
import dis
dis.dis(foo_simple)
Hey Guys,
I did know about the dis module. See Line 20 in the source
However I don’t like two things about it.
1. It just prints the instruction sequence on stdout. I would rather have it as a list of tuples that I can process programatically – detect instruction sequences for example.
2. It prints out blocks of instructions corresponding to source lines. I would rather prefer to group instruction sequences corresponding to Basic Blocks. I have an augmented pretty_print version that does this.
Anyway these are just some random notes from my study. The disassemble function was for my own understanding – not a replacement for the dis() module.
[...] online community. The best part is … it’s all 100% free! Check them out here: Join Hey Nielsen! Exploring Python Bytecode « Thermal Noise saved by 1 others Gaararox125 bookmarked on 01/08/08 | [...]
I’m sure you know this, but the optimizations you’re talking about are not peephole optimizations. In typical compilers, these optimizations are handled on the DAGs, i.e. before code generation or peephole optimization. Peephole optimization typically refers to the last stage in optimization that just removes instructions that make little sense.
I think it was a great idea to peep into the pyc file.
Good work !
jeff
god damned !!!!
is there a website where i can figure out how to use byteplay ?
there seems to be no real documentation about byteplay,
not any instructions on how to use it on the web!!
i want to “disassemble” a .pyc file, modify and “reassemble” it
could anybody explain this step by step for a non python supergeek
is byteplay the best way to to this anyway?
(sorry for swaring!) zeo
[...] idea, it’s just a nudge in that direction, to see what’s possible. Thanks are due to Thermal Noise for getting me pointed in the “right” direction for the bytecode stuff. [...]
Alex Martelli’s Talk: http://www.youtube.com/watch?v=23s9Wc3aWGY
Covers some bytecode towards the end. Not a lot of information – but still useful.
Plus Py3K has a new peephole.c with mildly interesting optimizations. But the best progress here has been through unladen swallow – http://code.google.com/p/unladen-swallow/. They have a LLVM codegen working.
I might do a more up2date “state of the python runtime” post soon.