I've moved my blog to jmcneil.net. This is no longer being updated!

Thursday, October 23, 2008

Guppy

Over the course of the the last two months or so, I've been consolidating
a small collection of proprietary XML-RPC daemons down into a series of
egg-based plugins for twistd. So far, I'm very happy with the result.

Yesterday, I started doing some initial load testing to watch for file
descriptor and memory leaks. I noticed that the 'twistd' process started
about at 19 MB of memory and slowly increased if I kept a steady stream
of method calls going.

I had some initial trouble trying to debut the leak. I'm actually
pretty happy with what I came up with.

I came across Guppy (http://guppy-pe.sourceforge.net/), which is a very
nice little tool that lets one dump Python heap statistics. The issue I had,
though, is that my application is a daemon. I need to trigger dumps as
I increase load.

My solution is as follows. I added the following to one of the modules that contains a subset of my XMLRPC methods.


try:
from guppy import hpy
@exposed
def dumpheap():
"""
Dumps Python heap information to a temporary file.

Args: None
Returns: string filename
"""
filename = '/tmp/heap.%d.%5f' % (os.getpid(), time.time())
context = hpy()
context.heap().dump(filename)
return filename

except ImportError, e:
pass


The '@exposed' decorator is an internal function which wires up XMLRPC
plumbing for dumpheap. So now, I can trigger a heap statistics dump
by simply calling an XML-RPC method. Easy!

Thursday, October 16, 2008

Thanks, GoogleBot!

There are very few things I'll write in C these days. I just don't have a reason to use it. If I can write it in Python, I usually will unless I have a good reason not to. One of the exceptions is an Apache module that we use to dynamically manage virtual host data (as opposed to flat files). I pull all configurations out of LDAP. I'm able to get some sick scale using a dynamic approach and I never have to restart my server.

The code has been in existence for about 7 years now in various forms. I'm not the original author, though I've probably replaced about half of it as our needs change.

At the core, there is a shared memory cache that Apache children attach to. As data is pulled out of LDAP, it's jammed into that cache. I'm also storing negative entries in order to prevent against DOS situations. Data is expired after a configurable time frame. The expiration is handled by a separate daemon.

So, about a week ago, we started having issues with nodes locking up. The expiry daemon was sitting at around 100% CPU and Apache would not answer requests. An strace on the expiry process showed no system calls.

These are the fun ones. Probably stuck in a while loop due to a buffer overrun or some such problem.

Well, I stuck some debug code into the expiration process and I see the following:


expire: domain.com
expire: domain2.com
expire: big long string with lots of spaces in it.. more than 128 bytes long ending in arealdomain.com???????


The question marks being the terminal chars for "I don't know how to render that!" Turns out that uncovered a bug in that domains over 127 chars were not NULL terminated when added to the negative cache.

In digging further, I checked my access logs. It turns out it was GoogleBot sending that big long string of junk as a 'Host:' header. Each time GoogleBot would hit a specific site on my platform, it would pass that in. It's amazing to me that we've not had this problem before and that GoogleBot was the first agent to trigger it...

Of course, it could always be a fraudulent user agent as I forgot to check the IP ownership before I ditched the logs...