I've moved my blog to jmcneil.net. This is no longer being updated!

Tuesday, January 20, 2009

Learning How it Really Works

Over the past couple of weeks, I've been trying to dive much deeper into Python than I have in the past. Overall, I have about 4 years of experience writing Python code and a few C extensions. It feels as though I should know more about the platform that has been paying my mortgage.

My goal today was to instanciate a code object manually and piece together a string of bytecode that would simply import another module. I'm by no means an expert at this!

As a guide, I am using the PyCodeObject structure as defined in code.h

/* Bytecode object */
typedef struct {
PyObject_HEAD
int co_argcount; /* #arguments, except *args */
int co_nlocals; /* #local variables */
int co_stacksize; /* #entries needed for evaluation stack */
int co_flags; /* CO_..., see below */
PyObject *co_code; /* instruction opcodes */
PyObject *co_consts; /* list (constants used) */
PyObject *co_names; /* list of strings (names used) */
PyObject *co_varnames; /* tuple of strings (local variable names) */
PyObject *co_freevars; /* tuple of strings (free variable names) */
PyObject *co_cellvars; /* tuple of strings (cell variable names) */
/* The rest doesn't count for hash/cmp */
PyObject *co_filename; /* string (where it was loaded from) */
PyObject *co_name; /* string (name, for reference) */
int co_firstlineno; /* first source line number */
PyObject *co_lnotab; /* string (encoding addr<->lineno mapping) */
void *co_zombieframe; /* for optimization only (see frameobject.c) */
} PyCodeObject;

It's possible to generate a code object from within Python via the types.CodeType class. The documentation states that the object requires 12 arguments, all which correspond to the above structure members.

class code(object)
| code(argcount, nlocals, stacksize, flags,
| codestring, constants, names,
| varnames, filename, name, firstlineno,
| lnotab[, freevars[, cellvars]])
|
| Create a code object. Not for the faint of heart.
|
In my little example, I simply want to "import this", which should trigger a print of The Zen of Python. The first step was determine what my bytecode string needs to look like. To do this, I wrote a simple Python module and generated a .pyc file. I was then able to open the file and extract what I needed to run my import.

[jeff@marvin ~]$ cat testmodule.py
import this
[jeff@marvin ~]$

Next, import the module.

[jeff@marvin ~]$ python -c 'import testmodule'
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
[jeff@marvin ~]$

Good. Now we have a .pyc file. Next, we'll open the file, pull the magic out,
and determine what our actual bytecode is.

>>> import dis
>>> import marshal
>>> f = open('testmodule.pyc', 'rb')
>>> f.read(8)
'\xd1\xf2\r\n\x935vI'
>>>

We read the first 8 bytes of the file to move the file pointer. This gets us past the magic data and to the start of the actual marshaled code object. To access that object:

>>> marshal.load(f)
<code object <module> at 0xb7f43578, file "testmodule.py", line 1>
>>> c = _

The bytecode string itself is stored in c.co_code. Using the dis module, we can take a look at the bytecode layout.

>>> dis.dis(c)
1 0 LOAD_CONST 0 (-1)
3 LOAD_CONST 1 (None)
6 IMPORT_NAME 0 (this)
9 STORE_NAME 0 (this)
12 LOAD_CONST 1 (None)
15 RETURN_VALUE
>>>

Ok, so the instruction we're *really* worried about is at offset 6. It is an IMPORT_NAME opcode, with arguments 0, and (this). What do those arguments mean? The C code that actually executes IMPORT_NAME is located in Python/ceval.c:2088.

...
case IMPORT_NAME:
w = GETITEM(names, oparg);
x = PyDict_GetItemString(f->f_builtins, "__import__");
if (x == NULL) {
PyErr_SetString(PyExc_ImportError,
"__import__ not found");
break;
...

The first statement executed sheds a bit of light on the details. The '0' argument to the IMPORT_NAME opcode is an index into a names tuple. In our scenario, the corresponding value needs to be the name of the module that we're loading.

We're going to ignore STORE_NAME. The other opcode we care about is LOAD_CONST. It's corresponding argument serves the same purpose:

...
case LOAD_CONST:
x = GETITEM(consts, oparg);
...

Now we can build up a bytecode string. Note that the indexes specified below will correspond to other elements of the CodeType class. We've quite simply not set that up yet. Our raw bytecode string looks like the following:

d\x01\x00d\x00\x00k\x00\x00d\x00\x00S

This translates to:

100 1 0 100 0 0 107 0 0 100 0 0 83

Now, using 'dis.dis' on the above code string, we wind up with the following byte code:

0 LOAD_CONST 1 (1)
3 LOAD_CONST 0 (0)
6 IMPORT_NAME 0 (0)
9 LOAD_CONST 0 (0)
12 RETURN_VALUE

So, now we can call types.CodeType and build up a code object using our own home-brewed byte string.

import types

func_code = 'd\x01\x00d\x00\x00k\x00\x00d\x00\x00S'
c = types.CodeType(0, 1, 1, 0, func_code,
(None, -1), ('this',), ('this',), 'test_filename', 'test_name', 1, '')

eval(c)

The three tuples are indexed by LOAD_CONST and IMPORT_NAME as exampled above. It's now possible to translate 'IMPORT_NAME 0 (0)' into 'IMPORT_NAME 0 (this).' The final argument, defined as lnotab lets us translate address/lineno mappings. My assumption is that it is used for mapping between marshaled code and line numbers, starting with 'firstlineno.'

When the code object is disassembled, the values are referenced correctly:
>>> dis.dis(c)
1 0 LOAD_CONST 1 (-1)
3 LOAD_CONST 0 (None)
6 IMPORT_NAME 0 (this)
9 LOAD_CONST 0 (None)
12 RETURN_VALUE
>>>

Running the above code does exactly what it should:
[jeff@marvin ~]$ python test.py
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
[jeff@marvin ~]$

1 comment:

Philip Jenvey said...

Note that dis can dump bytecode via the command line:

$ python -m dis testmodule.py
1 0 LOAD_CONST 0 (-1)
3 LOAD_CONST 1 (None)
6 IMPORT_NAME 0 (this)
9 STORE_NAME 0 (this)
12 LOAD_CONST 1 (None)
15 RETURN_VALUE