My Octopress Blog

A blogging framework for hackers.

Python and Arbitrary Function Arguments - **kwargs

Python has a pretty useful policy: named arguments. When you call a function, you can explicitly say that such-and-such value is what you’re providing for a particular argument, and can even include them in any order:

def hello(first, last):
	print 'Hello %s %s' % (first, last)

hello(last='Lecocq', first='Dan')

In fact, you can programmatically gain insight into functions with the inspect module. But suppose you want to be able to accept an arbitrary number of parameters. For example, for a printf equivalent. Or where I encountered it in wanting to read a module name from a configuration file, as well as the arguments to instantiate it. In this case, you’d get the module and class as a string and then a dictionary of the arguments to make an instance of it. Of course, Python always has a way. In this case, **kwargs.

This is actually dictionary unpacking, taking all the keys in a dictionary and mapping them to argument names. For example, in the above example, I could say:

hello(**{'last':'Lecocq', 'first':'Dan'})

Of course, in that case it’s a little verbose, but if you’re getting a dictionary of arguments programmatically, then it’s invaluable. But wait, there’s more! Not only can you use the **dict operator to map a dictionary into parameters, but you can accept arbitrary parameters with it, too!

def kw(**kwargs):
	for key, value in kwargs.items():
		print '%s => %s' % (key, value)

kw(**{'hello':'Howdy!', 'first':'Dan'})
kw(hello='Howdy!', first='Dan')

Magic! No matter how you invoke the function, it has access to the parameters. You can even split the difference, making some parameters named and some parameters variable. For example, if you wanted to create an instance of a class that you passed a name in for, initialized with the arguments you give it:

def factory(module, cls, **kwargs):
	# The built-in __import__ does just what it sounds like
	m = __import__(module)
	# Now get the class in that module
	c = getattr(m, cls)
	# Now make an instance of it, given the args
	return c(**kwargs)

factory('datetime', 'datetime', year=2011, month=11, day=8)
factory('decimal', 'Decimal', value=7)

This is one place where Python’s flexibility is extremely useful.

Kevin Mitnick’s Ghost in the Wires

I recently finished reading one of Kevin Mitnick’s books, Ghost in the Wires. Fantastic. I constantly found it amazing that someone had lived that life, hacking, evading capture, changing identities. Reads like an action movie at many points, and in fact, several movies have been made loosely (and one very loosely) based on his life. Mitnick often talks about how much the “myth of Mitnick” is inflated or distorted, especially in the media and particularly with the movies.

As it turns out, Mitnick lived briefly in Seattle, and with my interest piqued, I figured I might be able to track down his old apartment. He describes going home one day before realizing his was being followed, and in the course of the description he mentions a few street names and the part of town he lived in. And at the end of the book, there’s a photo of the apartment, slightly too grainy to read the name of the building. But clear enough to read the number. A little time with Google Maps and found it! Being so close, I figured I’d drop by to take a picture:

Named Pipes

Yesterday I encountered a concept I hadn’t known about: named pipes. They’re essentially a path that acts as a pipe for reading from / writing to. In that sense, you work with them like with file redirection and traditional files. But that data doesn’t get stored anywhere really permanent. All data that goes through it is meat to be written once, and read once, and it comes with a performance boost of not having to write large chunks to disk.

Pipes, for those who don’t know, are the bees knees. They’re the cat’s meow. They allow you to (as the name implies) make a pipeline between one or more programs, with the output of one feeding into the input of others. Suppose, for example, that we want to find out how many files that contain ‘.a’ there are in a directory. There’s a tool you might know, ‘ls,’ that lists all the files in a directory. And ‘grep’ is a tool to search for lines of text that match a regular expression. And ‘wc’ is a tool that can count the number of bytes, words, lines, etc. in a file.

Typically, each of these operates in isolation, reading from a file (in the case of grep and wc), or… standard input. And they all write to standard output. A pipe is away to hook up one’s process’ standard output file descriptor to the standard input file descriptor of the another, making one the producer of information and the other the consumer:

ls -l /path/to/some/directory | grep '.a' | wc -l

This is typical of the design of many command line utilities. Most either come with an option to read from standard in (usually either the absence of a filename or a single ‘-‘). And most do exactly one task well. Each has one very specific purpose, but is generally happy to play along with others.

File redirection is another handy tool that is related to named pipes. File redirection lets you either read the contents of a file as if it were standard input, or have a process write to it as if it were standard output. Going back to the earlier example, if we wanted to store a list of the all such files in our own file called ‘list’:

ls | grep '.a' > list

Easy as pie. Now… for named pipes. They’re also called ‘FIFO’s for their first-in-first-out behavior. You can make one with ‘mkfifo <filename>’. And then, feel free to read from it and write to it. Perhaps in two different terminals:

# In one terminal:
mkfifo test
cat < test

# In another terminal:
echo 'hello' > test

The first terminal, cat plugs along doing the one thing it knows how to do: display what it reads in out to standard out. Take a minute for what has just happened to sink in. You were able to have one process wait around until it had something read… from a pipe. And in a completely different terminal, you had a different process communicate with the first one through opening a file. This is a mechanism that’s commonly used for inter-process communication (IPC) for obvious reasons – everyone knows how to read from and write to a file, so it makes use of a known paradigm. But wait – it gets even better.

Suppose you want to aggregate some statistics about how many different types of requests your application serves, but you don’t want to have to write that in. Or maybe it’s an application that you know already just writes to a log file. Of course, you could trawl the log file, but there are conceivably cases where you don’t want the overhead of keeping around huge files, so you’d rather avoid it if possible. You have to be careful when doing this (not all applications play nicely with named pipes – mostly surrounding blocking described below), but chances are you might be able to dupe the application into just logging to a named pipe! If you remove the log file and in that same path you make a pipe, then your work is done – just read from that pipe to aggregate your statistics periodically. This works particularly well with the python logging module.

Reading from and writing to a named pipe can be a little more nuanced, however. Some things to bear in mind:

  • Opening a named pipe can block, so consider opening them non-blocking. Of course, it depends on your access pattern, but if you’re not sure if another process has written to the pipe and you don’t want that to trip up your reading, non-blocking is the way to go.
  • Named pipes have ‘no size.’ If you write to a pipe, data gets queued up for the other end to read, but even before that gets read, stat(1) reports that the file has a size of 0 bytes. So, you can’t rely on a change in file size to know it’s ready for reading.
  • Instead, use select, poll, epoll, etc. to detect read/write-ability on the pipe. If you’re only interested in one file descriptor, then you can go ahead and use select, but if you’re starting to listen to too many, perhaps one of the others is a better idea

System Stats in Python

Turns out, there’s a pretty handy package called psutil that allows you to not only gain insight into the currently-running process, but other processes, physical and virtual memory usage, and CPU usage. For example:

import psutil

psutil.phymem_usage().percent
# 31.2
psutil.virtmem_usage().percent
# 0.0

Pretty handy tool if you’re doing any sort of monitoring!

Libcurl, Curl_multi, and Endless DNS Pain

Curl is an awesome tool. Awesome. For example, suppose we want to fetch a bunch of urls as quickly as possible from code. Enter curl_multi, which allows you to manage several requests in flight at the same time. You periodically make a request to curl to see which handles are finished successfully and which are in error, and then add those handles back into a pool that can handle requests.

Any code that I use in this post will be in Python, using pycurl as, well, that’s what we’re using it in. The main flow of this has been (based on the example provided by pycurl): <pre lang=”python”> import pycurl # Declare a queue queue = [ url1, url2, … ] inFlight = 0 # Make a multi handle multi = pycurl.CurlMulti() multi.handles = [] # Allocate a bunch of individual handlers for i in range(poolSize): c = pycurl.Curl() # … multi.handles.append(c) # Pool of our free handles pool = multi.handles[:] # Now, go through our requests: while len(queue) or inFlight: c = pool.pop() # Make your request inFlight += 1 while True: ret, numHandles = multi.perform() if ret != pycurl.E_CALL_MULI_PERFORM: break while True: numQ, ok, err = multi.info_read() # ok is a list of completed requests for c in ok: # Get your data out inFlight -= 1 # err is a list of errored requests for c in err: # Maybe report an error inFlight -= 1 </pre>

This is great – we get to make a bunch of requests and they respond quickly. But… there are some problems. When you start using thousands of handles, libcurl won’t miss a beat. Depending on how you have libcurl built, it may not have support for asynchronous DNS requests, and may in fact… use blocking DNS requests. So this high throughput means a lot of contention. Libcurl can do this, but it needs to be compiled against c-ares. Curl provides a quick way of telling you what it has support for:

$> curl --version
curl 7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smtp smtps telnet tftp 
Features: AsynchDNS GSS-Negotiate IPv6 Largefile NTLM SSL libz

If you don’t see AsynchDNS listed among those features, and you want to add support, this means DNS unpleasantness can ensue. Unfortunately, building and installing the c-ares library library isn’t the only concern. You may have gotten curl with your OS, or you may have gotten it through your package manager. But if you find you need this support, you’ll have to build libcurl from source. First, install some generic dependencies you’ll probably want to include:

# On Ubuntu, it's pretty simple:
sudo apt-get install zlib1g-dev libssh2-dev libssl-dev libc-ares2 libc-ares-dev libidn11-dev libldab2-dev
# On Mac, you can 1) build from source, 2) use port
sudo port install libidn openssl libssh2 zlib c-ares

Then, you’ll just have to download, configure, compile and install libcurl: <pre lang=”bash”> # First, get the source: curl -O http://curl.haxx.se/download/curl-7.21.7.tar.gz tar xf curl-7.21.7.tar.gz cd curl-7.21.7 # Now, build everything. This assumes that you installed stuff in /usr ./configure –with-{ssl,libss2,zlib,libidn}=/usr –enable-{ares,ldap} make sudo make install </pre>

And now, we have asynchronous DNS in libcurl! It’s really not a feature that’s mission critical for many / most applications of curl. I have invoked curl hundreds of time, and used libcurl on occasion, and have not found it to be an issue until this scale. There are some other issues that a former SEOmoz’zer wrote about earlier this year, which we intend to include. For example, we may find that the curl-internal multi perform may be based heavily on select, in which case we’ll have to adapt our code to make use of epoll or kqueue. In all likelihood, we’d probably rather use libevent, which has bindings in python and helps to abstract the differences between BSD’s kqueue and epoll on Linux.

Yes(1)

Yes, yes(1) is built-in to Mac and Linux (at least OS X Lion and Ubuntu 11.04). And, as you might guess, it repeatedly prints a string of your choice (‘y’ by default) followed by a newline to stdout. Its sole purpose in life is to automate agreeing to prompts. I encountered it recently in a script that was automating RAID array deployment on EC2 ephemeral disks:

1
2
# mdadm doesn't let you automate by default, so pipe in 'y'!
yes | mdadm ...

Keeping Build Notes

I initial put off upgrading to Snow Leopard until almost a year after its release because I was worried about rebuilding my development environment. It’s amazing how many packages one accumulates over time without thinking about it, and when you have deadlines to meet it can be disastrous to risk your current working setup.

But rebuilding your development environment comes up more than just upgrading your OS. If you need to migrate to that new computer you got, or that work gave you, or help someone else get up and running with a project you’re thinking about releasing. Admittedly, it took me a little while to learn this lesson, but finally it’s drilled into my head: keep build notes!

A couple weeks ago I was trying to install an internal package whose docs hadn’t been updated in a very long time. After struggling and hitting countless snags, I finally got it up and running when I got an email that was along the lines of, “Oh, if you could write down what problems you ran into, that would be great.” Fortunately, I just made notes of what I had done in order to get it built, and I was able to whip off a reply with speed that surprised the recipient.

Even at a system-wide level, I try to make it a habit to record every package I install/build associated with development. It makes it extremely easy to get set up on the next system, even if the instructions have to be updated for a new environment. I call it a manifest and I manage it as a flat file, though I know there are package managers that can do a lot of heavy lifting for me. However, I find that no package manager is perfect and so even if I make use of one for certain libraries, it’s important to me to have everything documented in one place. At a minimum (and you probably don’t need more than this) keep the following:

  1. Package name and version - Maybe you needed readline 6.1 to get your project running, or you know that such-and-such version is buggy for your purposes.
  2. Why you installed it - I find that many libraries I install are used for a particular project, and so it’s useful to have the motivation for getting it.
  3. How you installed it - Whether it was macports or a typical configure, make and install, how did you build it? Did you need special flags to make it go? You will absolutely forget these, so why not write them down? Even just copy and paste from your history!

I can’t stress enough how much easier this has made my development life in a lot of ways, and how little a time investment it is.

Conditional Compilation

Last week I had the (dis)pleasure of porting some code to Mac, and today it came time to merge with the original codebase. As helpful as it was to use macros for different code paths, we needed something in the makefile to optionally add flags when compiling on Mac.

1
2
3
4
5
6
// This is all well and good
#ifndef __APPLE__
    // Do your Linux-y includes here
#else
    // Do your Apple-y includes here
#endif

Apparently, there are a couple conventions for doing this. First, you can inject a configuration step (à la autoconf, for example) which would detect what platform you’re building on in a robust way and build a Makefile for you. Second, if you’re lazy or autoconf would be like hitting a fly with a hammer, you can use make’s conditionals:

1
2
3
4
5
6
7
8
9
10
# Ensure that this gets declared in time,
# and fill it with the result of `uname`
UNAME := $(shell uname)

# If the environment is Darwin...
ifeq ($(UNAME), Darwin)
    CXXFLAGS = # Something Apple-y
else
    CXXFLAGS = # Something Linux-y
endif

Simple enough!

Type-Conversion Operators’ Unintuitive Behavior

A feature I only recently learned about are type-conversion operators. For any class, if you want to support type conversion to any type, you can do so by merely declaring (and of course defining) operators of the form operator type():

1
2
3
4
5
6
7
class Widget {
...
operator bool();
operator thing();
operator Foo();
...
}

While this is fine and dandy (and admittedly obviously attractive in ways), there is a big problem SEOmoz co-worker Brandon pointed out: There’s no way to determine which code path will be taken.

For a little bit of context, I came across a set of type-conversion operators that seemed reasonable enough. They tried to cover the whole gamut of possible primitive types:

1
2
3
4
5
operator unsigned long long() const;
operator long long() const;
operator unsigned long() const { return operator unsigned long long(); }
operator long() const { return operator long long(); }
...

The compiler has absolutely no problem with the above declaration. The class you put that in will happily compile, but the problem arises when you try to use it:

1
2
3
Widget w(...);
// Suddenly, the compiler complains, not knowing which operator to use
unsigned long int foo = w;

At this point, the compiler puts its foot down. What to me seems unintuitive is that even though there is an conversion operator to this exact type, the compiler won’t use it. What’s even more bizarre to me is that typedefs and in-header definitions can further muddle things up:

1
2
3
4
5
6
7
8
operator long long() const;
operator long() const;
operator int() const;
operator short() const;
// For whatever reason, let's say you do this:
operator int32_t() const {
    return operator long long();
}

Even though int32_t will be the same as one of those other types, this will still compile. It makes a certain amount of sense when viewed in the context of the compiler because after all, it only does so much processing on headers because they’re going to be directly included wherever you use them. You actually don’t get duplicate symbols in this case, and thus no “previously-defined” error. In reality, their function definitions are the same, and they actually get mangled to the same name (on my system the operators for int32_t and int both mangle to ‘_ZNK6WidgetcviEv’):

1
2
3
4
# See what mangled symbols actually appear
nm -j widget.o
# See what demangled symbols are actually there
nm -j widget.o | sed s/__/_/ | grep -v .eh | c++filt -n

The above (with in-header definitions) is exactly what we encountered in the code. We (well, a co-worker) suspected that the reason that the sort of multiple definition was allowed was that the names were getting mangled based on their typedef name string (mangled on int32_t instead of the actual type it maps to), but this is not the case. If you move the in-header definition for the int32_t operator into the .cpp file, the compiler will complain to you earlier.

My first inclination when dealing with the “conversion to type long long is ambiguous” error was to ask for an explicit conversion: static_cast<long long int>(myWidget). However, this doesn’t work either. So even in this scenario, you can’t even ask for a specific type conversion operator. From what I can gather, type-conversion operators are a double-edged lightsaber: few things in C++ were added without a purpose, but it’s extremely important to understand that exact purpose and its risks. To require that type conversions are explicit you should generally use something like:

1
2
3
4
5
6
7
8
9
10
template <class T>
const T convert() const {
...
}

template <>
const bool convert<bool>() const {
// Your conversion to bool
...
}