I've moved my blog to jmcneil.net. This is no longer being updated!

Thursday, October 16, 2008

Thanks, GoogleBot!

There are very few things I'll write in C these days. I just don't have a reason to use it. If I can write it in Python, I usually will unless I have a good reason not to. One of the exceptions is an Apache module that we use to dynamically manage virtual host data (as opposed to flat files). I pull all configurations out of LDAP. I'm able to get some sick scale using a dynamic approach and I never have to restart my server.

The code has been in existence for about 7 years now in various forms. I'm not the original author, though I've probably replaced about half of it as our needs change.

At the core, there is a shared memory cache that Apache children attach to. As data is pulled out of LDAP, it's jammed into that cache. I'm also storing negative entries in order to prevent against DOS situations. Data is expired after a configurable time frame. The expiration is handled by a separate daemon.

So, about a week ago, we started having issues with nodes locking up. The expiry daemon was sitting at around 100% CPU and Apache would not answer requests. An strace on the expiry process showed no system calls.

These are the fun ones. Probably stuck in a while loop due to a buffer overrun or some such problem.

Well, I stuck some debug code into the expiration process and I see the following:

expire: domain.com
expire: domain2.com
expire: big long string with lots of spaces in it.. more than 128 bytes long ending in arealdomain.com???????

The question marks being the terminal chars for "I don't know how to render that!" Turns out that uncovered a bug in that domains over 127 chars were not NULL terminated when added to the negative cache.

In digging further, I checked my access logs. It turns out it was GoogleBot sending that big long string of junk as a 'Host:' header. Each time GoogleBot would hit a specific site on my platform, it would pass that in. It's amazing to me that we've not had this problem before and that GoogleBot was the first agent to trigger it...

Of course, it could always be a fraudulent user agent as I forgot to check the IP ownership before I ditched the logs...


Candace Groeninger said...

hahah at first read, i thought the "big long string" was really what you saw in your output and almost fell over laughing. Do you know how many times in my code something competely inappropiate has popped out b/c I forgot to ditch some trace variable I was using in an early version.

When i first got my job, there was an incident where one of my course's loaded, "what the crap" until the XML file populated the page. I threw up immediately after finding out like 80,000 student's were taking the course.

Ps- what you do for a living makes my head whirl. I do ActionScript and want to shoot myself in the face most days.

Jeff McNeil said...

I usually just '#ifdef TRACING' or the language specific equivalent.

On one of the systems I've built, I dynamically look up a bunch of site data in LDAP for each HTTP hit. The tracing routines print stack data to a file under '/tmp' as each call is made. Each hit results in a stack about 5 or 6 calls deep.

When I first pushed that code into production, it ran on five web servers, each handling about 200 hits per second. I had forgotten to rebuild without tracing.

That trace file filled up fast.