My Octopress Blog

A blogging framework for hackers.

Running Sudo With Pssh

The pssh tool is great. Just great. At SEOmoz we use a number of deployment schemes, but every so often I find myself needing to log into 50 machines and perform some simple one-off command. I’ve written such a line many times:

1
for i in {01..05}; do ssh -t ec2-user@some-host-$i 'sudo ...'; done

This is fine if the command is quick, but it really sucks if it’s something that takes more than just a few seconds. So in the absence of needing to use sudo (and thus the -t flag), pssh makes it easy to run these all in parallel:

1
pssh -i --host=some-host-{01..50} -l ec2-user 'hostname'

Coercing pssh to create a pseudo-tty to enable sudo commands was a little tricky, though:

1
2
# And now I can sudo!
pssh -x '-t -t' -i --host=some-host-{01..50} -l ec2-user 'sudo hostname'

Rough Array Hash Benchmark

I recently went on a mission to find (and perhaps build) a better dictionary. I’ve been looking at Dr. Askitis’ work on so-called HAT-tries, which are something akin to a burst trie. It all seems reasonable enough, and his experimental results seem very promising. In particular, I was looking for an open source version of this data structure that didn’t preclude commercial use (as Askitis’ version does).

HAT-tries rely on another of Askitis’ data structures, the hash array map. Essentially, it’s a hash table, but instead of using linked lists to store nodes containing the various key/value pairs stored in a particular slot, each slot is actually a buffer that stores a bunch of packed information, including the number of items in the buffer, the length of the string key, and the value itself. The idea is that this arrangement is much more conducive to caching (and hardware prefetches out of memory since each slot is a contiguous slab of memory). Contrast this to a more conventional approach in which each slot is a linked list that must be traversed to find the actual key/value pair that’s being retrieved.

Taskmaster From DISQUS

I have been waiting for an occasion to use dcramer’s taskmaster, which is a queueing system meant for large, infrequently-invoked (even one-off) tasks. In his original blog post brings up one of the features that was particularly striking to me – you don’t put jobs into the queue per se, but you describe a generator that yields all the jobs that should be put in the queue.

Occasionally at SEOmoz, we want to perform sanity checks on customer accounts, or transitioning from one backend to another, etc. In particular, we’ve been transitioning to a new queueing system, and we wanted to go through every customer and ensure that they had a recent crawl, and further, were definitely in the new system. Unfortunately, much of the data we have to check involves a lookup into Cassandra (that can’t be turned into a bulk operation very easily). Cassandra’s not necessarily the problem, but just the latency between requests. So, spawn off 20 or so workers with taskmaster, each given the details about the customer that we needed to verify.

The serial version takes 4-5 hours. It took 15 minutes to get taskmaster installed and grokked, and then the task itself took an hour. Already a win!

Redis and Lua for Robust, Portable Libraries

Redis 2.6 has support for server-side Lua scripting. Off hand, this may not seem like a big deal, but it offers some surprisingly powerful features. I’ll give a little bit of background on why I’m interested in this in the first place, and then I’ll show why this unassuming feature is so extremely useful for otherwise impossible atomic operations, as well as for easy language portability and performance.

For example, I’ve recently been working on a Redis-based queueing system (heavily inspired by Resque, but with some added twists) and a lot of functionality that I wanted to support would have been prohibitively difficult without Redis’ support for Lua. For example, I want to make sure that jobs submitted to this queueing system do not simply get dropped on the floor. A worker is given a temporary exclusive lock on a job, and must either complete it or heartbeat it within a certain amount of time. If that worker does not, it’s presumed that the worker dropped the job and it can be given to a new worker.

The Cost of Except in Python

I was curious recently about how much of a performance penalty try/except blocks incur in python. Specifically, 1) does it incur much of a cost if no exception is thrown (accepting only a penalty when something exceptional happens) and 2) how does it compare to if/else statements where possible? A snippet to answer the first question:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import timeit

withTryNoThrow = '''
	try:
		a = int('5')
	except ValueError:
		pass
'''

withTryThrow = '''
	try:
		a = int('z')
	except ValueError:
		pass
'''

withoutTry = '''
	a = int('5')
'''

results = {
	'withoutTry'    : timeit.Timer(withoutTry    ).timeit(100000),
	'withTryNoThrow': timeit.Timer(withTryNoThrow).timeit(100000),
	'withTryThrow'  : timeit.Timer(withTryThrow  ).timeit(100000)
}

for k, v in results.items():
	print '%20s => %fs' % (k, v)

For me, the results looked something like this:

1
2
3
  withTryNoThrow => 0.082781s
      withoutTry => 0.082880s
    withTryThrow => 0.261147s

It would appear that while catching exceptions is expensive, catching non-exceptions is very cheap. I imagine that the reason is mostly because when you throw an exception, you actually instantiate an exception object of some kind, which necessarily introduces some overhead. In the absence of that object creation, things can be relatively fast.

Python’s Zlib and Gzip, Performance, and You

I have recently been working with streaming gzipped uploads to S3, and I’ve come to find that python’s built-in gzip library is quite a bit slower than the command-line utility gzip. Thus, I became curious as to where the performance problems were.

Gzip is actually just a file format, apparently most commonly used with zlib’s compression. It provides a file header and a footer and a little bit of metadata, but it really is merely a wrapper around zlib. However, while python’s zlib module is a compiled C extension, the gzip module is a pure python implementation that makes calls to zlib. In fact, the zlib source is included in the Python distribution (at least in 2.7.2) and the C extension is just a wrapper around those function calls.

Interestingly enough, the zlib extension is fast. Really fast. Competitively fast. Even comparable to the gzip command line utility, which, though it does the extra work of writing headers, becomes a nearly apples-to-apples comparison with a large enough input file.

import zlib
import time
import subprocess

with file('testing.in', 'w+') as f:
	test = 'hello how are you today?' * 10000000
	f.write(test)

# First, let's zlib
start = -time.time()
with file('testing.in.Z', 'w+') as outf:
	with file('testing.in') as inf:
		outf.write(zlib.compress(inf.read()))

start += time.time()
print 'zlib: %fs' % start

# Now the subprocess
start = -time.time()
r = subprocess.check_call(['gzip', '-f', 'testing.in'])
start += time.time()
print 'gzip: %fs' % start

Python’s Logging Module - Exceptions

I’m a big fan of python’s logging module. It supports configuration files, multiple handlers (for both writing to the screen while writing to a file, for example), output formatting like crazy, and many other delicious features. One that I’ve only recently encountered is its exception method.

The basic idea of the logging module is that you can get a logger from a factory (that allows multiple pieces of code to easily access the same logical logging entity). From there, you add handlers that output messages to various places (files, screen, sockets, HTTP endpoints, etc.). Every message you log is done at a specific level, and then the configuration of the logger determines whether or not to record messages of a certain severity:

import logging

# Get a logger instance
logger = logging.getLogger('testing')

# Some initialization of handlers here, 
# unimportant in this context

# Print out at various levels
logger.warn('Oops! Something happened')
logger.info('Did you know that X?')
logger.debug('Index is : %i' % ...)

What’s great about the module is that it separates your messages from how they’re displayed and where. For debugging, it’s nice to be able to flip a switch and turn on a more verbose mode. Or for production to tell it to shut up and only log messages that are really critical. What the ‘exception’ method does is to not only log a message as an error, but to also print a nice backtrace of where the error took place!

try:
	# So something here
	raise Exception('oops!')
except:
	logger.exception('Such-and-such failed! Stack trace to follow:')
	# Stack trace appears in the log

Never Trust Callbacks

It’s a lesson that has now been hammered home repeatedly in my head: never trust callbacks. Just don’t. Go ahead and execute them, but if you trust them to not throw exceptions or errors, you are in for a world of unhappiness.

For me, I first learned this lesson when making use of twisted, writing some convenience classes to help with some of the somewhat odd class structure they have. (Sidebar: twisted is an extremely powerful framework, but their naming schemes are not what they could be.) Twisted makes heavy use of a deferred model where callbacks are executed in separate threads, while mission-critical operations run in the main thread. My convenience classes exposed further callbacks that could be overridden in subclasses, but I made the critical mistake of not executing that code inside of a try/except block.

Twisted has learned this lesson. In fact, their deferred model makes it very hard to throw a real exception. If your callbacks fail, execution takes a different path – calling errback functions. In fact, twisted is so pessimistic about callbacks (rightly so) that you just can’t make enough exceptions to break out of errback functions. However, wrapped in my convenience classes were pieces of code that were mission critical, and my not catching exceptions in the callbacks I provided was causing me a world of hurt.

That whole experience was enough to make me learn my lesson. Then, a few days ago I encountered it again in a different library, in a different language, in a different project, where I was exposing callbacks for user interface code in JavaScript. The logical / functional chunk of code exposed events that the UI would be interested in, but was too trusting, leading to errors in callbacks skipping over critical parts of the code.

All in all, when exposing callbacks, never trust a callback to not throw an exception. Even if you wrote the callbacks it’s executing (as was the case with both of these instances, at least in the beginning). Callbacks are a courtesy – a chance for code to be notified of an event, but like many courtesies, they can be abused.