My Octopress Blog

A blogging framework for hackers.

Things I Wish Software Engineering Pounded Into My Head

Until recently, I bore the burden of a dirty little secret: I didn’t use unit testing. I don’t know if this has been other students’ experience, but my software engineering course could have been a lot better. Perhaps it’s a topic that requires students to determine their own level of involvement, but particularly, I am irked that these issues were not as developed as they could have been:

  1. Version control - svn, git, something! Anything, really! Just use something!
  2. Unit testing - sure we talked about it, but its utility isn’t clear until you use it in earnest
  3. Documentation - always felt more like a chore than a tool
  4. Issue tracking - Joel on Software said it best: “I was born with only two bug-storing-slots in my brain.”
  5. More best practices - Effective C++ (it’s worth it – get it and read it) alone was worth more than the course was. I concede that these are things whose utility might be hard to convey to college students who may or may not be working on projects they really believe in. Still, very important. One of the things I like about documentation is that you can refer to the documentation when uncertain about side-effects, input, etc. of functions and you don’t really feel like reading through code and trying to keep it in your brain-RAM. Not only that, but I usually refer to my own documentation to get an idea of the guarantee that a function was offering, to try to make sure that it’s making those same guarantees after I make changes. Not only this, but articulating functionality crystalizes it’s implications in your mind. When describing how something works to a friend, it often reminds you of issues you hadn’t considered. Writing documentation is the same in this respect. When working with other people, while it’s nice to be able to talk face-to-face, the issues we think of that need changing go in one ear and out the other for me. Not that it wasn’t a good idea, but my brain-RAM is not infinite. Bugs, features, ideas, issues all get paged out of my head. Write them down instead, which is where issue tracking comes in. Design questions and decisions come out of issues all the time, and the trail of comments and discussion make a nice reference for documentation. I find that when I’m fixing a bug, the functionality of that chunk of code gets reduced to that particular feature in my mind. So, if when I try using it, if that bug is fixed, then my job is done. But we’ve all experienced unintended side effects, where in fixing one thing, we break another. Having a list of features to check is a step up from trying to remember the things that need to work, but it’s much less than the power of unit testing. I’ve recently been using UnitTest++ for C++ applications, and QUnit from the jQuery people, and have been pleased to thrilled. It allows you to check functionality exhaustively and quickly on a whim. Even the task of writing the tests forces you to define the functional requirements of your code, and I find it helps to clarify a lot. Version control lets you try crazy new approaches and track changes, isn’t that great? I was helping a friend last semester with some of his code, and he was trying to make code changes with copy, paste, and the undo buffer. In the end, he ended up introducing more bugs into the code than he was fixing. It’s intractable to keep these kinds of code changes in your brain-RAM. Use something, use anything. I have a qualm with svn because it was something I used so rarely, I could never remember how to set it up. Even a barrier as small as that was enough to prevent me from using version control at all. I love git because it’s so easy to set up, I use it for almost anything: from a report (if using LaTeX), to a weekend project, to my thesis. There are tons of tools out there, but just use something.

Python Lesson Learned

I have finally learned a lesson after running into this fact repeatedly: no matter what you are trying to do, python has a module for it. I’m reminded of two xkcd comics in particular.

Sum of Dice

I recently had an interview in which I was asked, “how many ways can 1000 dice make the sum 3000?” I was caught a little off-guard, and the phrasing made me wonder about a closed-form solution which, after some reading, I don’t believe exists. Still, a workable solution isn’t as easy as that.

Let F(n, k) denote the number of ways n dice can make the sum k. Note that the first die can be 1, 2, 3, 4, 5 or 6, so F(1000, 3000) = F(999, 2999) + F(999, 2998) + … + F(999, 2994). While this recurrence relation leads to an easy implementation, it suffers from two big problems: performance and maximum recursion depth. Without any memoization or optimization, this would lead to 61000 function calls, which is about 10780. That’s a big number. If you were able to evaluate one such function call every clock cycle on every computer in the world (about 7 x 1018 cycles/second), you would need about 10750 years, or 10650 lifetimes of the universe (about 10100 years) to solve it. Get ready to wait.

One optimization is to check if the arguments n and k make sense at each level, and prune accordingly. For example, n dice can only sum to between n and 6n, so if k is outside of that range, you don’t need to branch down any farther. Even using this and instantaneous branch evaluation, your code still wouldn’t complete in the expected lifetime of the universe.

Obviously, this is still a solvable problem (there are about 10757 ways 1000 dice can sum to 3000). A python implementation figured that out in 3.5 seconds. One way to help crack this egg is to use memoization. Consider the calls to F(999, 2999) and F(999, 2998). Expand these out to get:

You’ll notice that both of these expansions call F(998, k = 2997, 2996, 2995, 2994, 2993), each of which is a lot of work to evaluate. So, if once you determined each of these values once, you could use them in other places and save huge amounts of time.

One thing you’d need in order to implement memoization is a map (or dictionary in python) that stores key-value pairs. So once you figure out F(998, 2997), you could store it like myMap[(998, 2997)] = …; but this can add some complication. Every time to call F, you’d have to check if you’ve already evaluated the function for those arguments. And when you’ve figured out and stored a lot of these values, each search will take some time. It will still work and will get you the answer you need, but probably not as fast as you’d like, and not as fast as it could be. Using memoization for F(700, 2100) took 8.2 seconds in a python implementation, but a better approach took 1.5 seconds. The maximum recursion depth prevented it from using bigger numbers.

Algorithmically, the best you’ll be able to do (as far as I can tell) is O(n2), which is not bad in the scheme of things, and certainly not the combinatoric time we saw first. Enter dynamic programming. Particularly, we’ll use a bottom-up approach. We’ll build two arrays, each associated with a number of dice. The i‘th index in that array is the number of ways n dice can sum up to (n + i). So for 1 die, myArray[0] is the number of ways you can roll a 1, and for 4 dice, myArray[10] would be the number of ways you can roll 4 dice to get 14.

As I said, we’ll keep track of two arrays, say previous and current. First, initialize previous to be the number of ways 1 die can be rolled: {1, 1, 1, 1, 1, 1}, because 1 die can be rolled each number in only one way. Then, we’ll use that to fill the array “current” with the number of ways 2 dice can be rolled, based on the array previous. For example, current[0] = previous[0] = 1, because the only way to get a sum of n with n dice is to roll all 1’s. Then, current[1] = previous[0] + previous[1], because if we roll a 2 for the second dice, there are previous[0] ways to sum up to 1 with the remaining die, or if we roll a 1 for the second dice, there are previous[1] ways to sum up to 2 with the remaining die.

Each entry current[i] should be the sum of previous[i-5, i-4, i-3, i-2, i-1, i], taking the bounds of the previous array into consideration (this can actually be generalized to any m-sided dice by using previous[i-m+1, …, i]). Once you’ve filled the array current, then you assign current to previous, and move on to the next number of dice, 3. Continue this until you have filled the array for the desired number of dice. If performance is a big issue, this can be implemented as a moving frame. For example, since current[i] uses previous[i-5, …, i] and current[i+1] uses previous[i-4, …, i+1], you can alternatively write current[i+1] = current[i] - previous[i-5] + previous[i+1], and use only two add/subtract operations instead of 5.

A bonus to this bottom-approach is that once finished, you actually have an array with the number of ways you can make any sum with the n dice. This makes it easy if you wish to find the number of ways n dice can form a sum up to or greater than a number.

Numerical_limits

I’m working on a C++ class that encapsulates changes to a “steerable” variable. It’s part of a system that allows trusted commands from a socket library to change the value of registered variables. For example, a client might:

1
2
3
4
5
6
7
8
int sInt(0);
float sFloat(0.3);
registerParam("someSteerableInteger", &sInt);
registerParam("someSteerableFloat", &sFloat);
...
// Library makes these calls after receiving remote commands:
setParam("someSteerableInteger", 5);
setParam("someSteerableFloat", 3.4);

There’s some intelligence in the class to ensure integrity of the data. For example, it will cast the value passed into setParam(...) correctly, so that a call to setParam("someSteerableInteger", 5.3) results in the 5.3 cast as an integer and then used to set the integer’s value.

Some of the other intelligence needed is that the user might specify the minimum and maximum valid values the variable might take on. Or he or she might not. If not, it would be nice to have a convenient way of specifying a reasonable default minimum and maximum.

Enter numerical_limits. It’s a static templated class that can give you some insight into the built-in types. For example:

1
2
3
4
5
6
#include <iostream>
#include <limits>
using namespace std;
...
cout << "Min unsigned int: " << numerical_limits<unsigned int>::min() << endl;
cout << "Max float: " << numerical_limits<float>::max() << endl;

Efficient Anagrams With Python

I love anagrams. I used to play an anagram game so often that I started to have trouble reading, as I would just automatically try to find anagrams of words. One way I’m able to stop playing an addictive game altogether is to write code that will solve the problem at hand for me. For example, at one point I had to do just that with Sudoku so I could reclaim my time.

There are a few thoughts that I’ve seen on finding anagrams with python, including some pretty concise examples. They center around finding all the permutations of a provided set of letters, and then checking these against known words. While this can be extremely concisely done with python’s itertools, it quickly becomes prohibitively slow for longer strings. That’s because the number of permutations it must calculate for a word of length ‘n’ is n!, much too large for reasonable n’s.

To clarify, in this discussion, by ‘anagram,’ we will mean any word made with the letters, of any length 1 to n. So, for the word “foobar,” then “fob” or “boo” would be valid.

There is some good news here. Pruning. With some intelligence and time up front, we can skip all permutations that begin with “fb,” as no word in our dictionary begin with those letters. We use a tree, with each node representing a string. A node marked as ‘valid,’ indicates that its associated string is a word. This string is obtained by the traversal from the root to the node.

[caption id=”attachment_920” align=”aligncenter” width=”323” caption=”Anagram tree structure”]Tree structure[/caption]

So, if we have the letters “uueeq,” we have a set of permutations beginning with “u”, a set with “e” and a set with “q.” We’ll begin with the root node, and for ease, we’ll examine permutations beginning with “q.” Of the remaining letters, “uuee,” there are only children from “q” to “u,” and so we can prune permutations beginning “qe.” We get words by keeping track of the string we have as we traverse the tree - each time we descend to a child node, we append that character, and when we return up a letter, we remove it.

As we traverse in this way, we check all intermediate nodes to see if they themselves are words. For example, in finding “batting,” we should also find that the word “bat” is valid. Or similar, in finding “quotable,” we should find “quota.”

If you don’t feel like working out the details yourself, I have some sample code for you. To use:

1
2
3
4
5
6
7
from gramana import gramana
# Read a list of valid words
words = file('dictionary.txt').read().lower().split('\n')
# Parse it into a tree
root = gramana(words)
# Look for words:
anagrams = root.search("testingsomeverylongstring")

Wrapping Printf(1)

Working on an application that had become a little… verbose, I decided it was finally time to wrap my prints in a function that could easily be switched on or off depending on whether or not I wanted it to be verbose. One approach I had seen before (from my OS professor) that has a certain amount of merit is to wrap statements with a macro:

1
2
3
#ifdef DEBUG
printf(....);
#endif

The nice thing about this approach is that debugging can be turned on or off easily at compile time. However, my experience has been that it leads to a lot of typing, and seeing too many macros in the middle of code makes my brain explode in a fiery rage. So, I figured that if I wrapped my prints in another function (I called mine ‘log’ and ‘error’), I could avoid this whole mess and keep my sanity. I’ve done this with a number of other projects in other languages, but I had to learn some magic to do it in C.

Lesson Learned #1 : Variable arguments. It turns out you can define functions that take a variable number of arguments with va_list (from <stdarg.h>). You define such a function:

1
2
3
4
5
6
void log(const char* fmt, ...) {
    va_list args;
    va_start(args, fmt);
    ...
    va_end(args);
}

Lesson Learned #2 : However, from what I can gather, you can’t just inject printf directly into this. However, having anticipated this, there is a set of functions designed for cases like this: vfprintf, vprintf, vsnprintf, vsprintf. The ‘v’ stands for va_list (variable-argument list), and you can use them just like you’d expect:

1
2
3
4
5
6
7
void log(const char* fmt, ...) {
    va_list args;
    va_start(args, fmt);
    fprintf(log_fd, "NOTE : ");
    vfprintf(log_fd, fmt, args);
    va_end(args);
}

The thing I like about this approach is that you have control over how log messages get printed in one place. So, for example, if I provided another function, setLogFD, then I could easily just set the file descriptor where all log messages get printed. So easy! Something I’ve used this for in other instances (especially servers) is to also inject additional information like a timestamp on every message. So, when I call:

1
log("Some event '%s' just happened.\n", event_name);

Then I automatically get “NOTE : “ and maybe a timestamp prefixed on that message. Which make code look clean, and adds a great deal of functionality. I actually added another function error(…) that prints to a different file descriptor in case I want to suppress debug messages, but no error messages. For additional layers of debugging, you might do something like this:

1
2
3
4
5
6
7
8
void log(int level, const char* fmt, ...) {
    va_list args;
    va_start(args, fmt);
    FILE* fd = my_log_files[level];
    fprintf(fd, "NOTE : ");
    vfprintf(fd, fmt, args);
    va_end(args);
}

This way, at startup, you could easily set some of the file descriptors in my_log_files to stderr and some to point to /dev/null or otherwise dissolve into the ether.

Rename(2)

I found myself needing to systematically rename a bunch of files, and previously the only way I knew to do it was with a loop, passing each filename into sed, and using the output as the destination filename. Tedious and error prone!

However, it turns out that people have anticipated this use case, and there is a utility for this exact purpose, called “rename.” It allows you to provide a perl-style regular expression whose input is the existing filename found by your search expression, and whose output is what that file should be renamed. There’s even an especially useful option to take no action, but only to print out what the new filenames would be, allowing you to perform a dry-run before you rename 100000 files.

Example:

1
rename -n 's/^/some.file.prefix./' *.jpg

Happy renaming!

SpokePOV

When moving to Saudi Arabia, it is important to bring hobbies. Some have video games. Others learn to dive or windsurf. Considering this fact when returning this most recent time, I bought a SpokePOV kit from the nice folks at adafruit industries.

Short story is it works by attaching these microcontroller-controlled strips of LEDs to the spokes of your bike tire. These then monitor the rate at which the wheels are moving (they sense a magnet places on the fork), and based on that, quickly turn on and off the various LEDs to form a pattern. If the wheel then spins fast enough, then your eyes perceive it as a static image. A long exposure does the same trick:

DSC_0002.NEF

It took maybe 7 or 8 hours of soldering (it was a slow start initially), and worked right off the bat. That is something I was particularly pleased with, as in my chosen field, code very rarely works correctly the first time.

Unfortunately over the summer my soldering iron disappeared from my apartment, and so I went looking for a new one. Our local supermarket, Tamimi (the Saudi Safeway franchise) sells soldering irons, but does not sell solder. Finding suitable solder was the mission of three days of biking back and forth between campus and downtown Thuwal.

Falling to the Earth

Earlier this summer, I agreed to go skydiving with my friend Chris. Yesterday, we finally took the jump.

Mile-Hi Skydiving is only about 20 minutes from where my parents live, and accept walk-ins. We showed up, signed away our lives and after a wait (after all, we didn’t make reservations), we were introduced to the instructors we’d be strapped to. I can’t speak for Chris, but the guy helping me out, Dave, was disconcertingly quiet when getting set up. He handed me a suit, and helped me put on a harness that loops around each leg, the torso and finally straps over the shoulders.

After a short wait, a truck with a flatbed trailer drove up, and about 12 people hopped on. There were two other people in our group making a tandem dive, as well as a few others making solo jumps (including one guy wearing a squirrel suit). We drive up to a plane whose propeller is roaring making it difficult to hear the guy next to you, and are packed into two benches facing each other. At this point the reality of the situation really began to hit me, and was compounded by the fact that the guy I would tumble to the ground with hadn’t yet told me anything that would be happening!

We were cleared for takeoff and began to climb. A few more experienced guys at the back held the door open for a while during our ascent, and it was relatively unsettling to see an open door on a plane like that. Eventually Dave told me to sit on his lap and he began hooking my harness to him and his parachute and tightening straps. “When we go out, don’t jump – I’m going to push you. Hold on here, and tuck your legs back and I’ll let you know when you can let go.” At this point, the solo jumpers were beginning to hop out. Dave hands me goggles and we move towards the door, and before I realize it, we’re tumbling out of a plane.

Instantly the speed of wind rushing past registers, thick as soup. Trying to concentrate, I remember to tuck my legs back and Dave locks legs to position us correctly, turning us ever so often to admire the view. It really was akin to zooming in on Google Earth, except for the attenuation of light from the distance, and the freezing air flapping violently at you. From Longmont’s Vance Brand Airport, we could see Boulder and Longmont laid out in front of us. I had expected that during freefall I would feel the same feeling in my stomach as when an airplane experiences turbulence, or an elevator sometimes jolts. The feeling of gravity giving out, and the instinct to grab onto something. Though, after the first few seconds (during which we were tumbling) there was nothing of the sort.

After about a minute, I felt Dave reach for something and I figure it’s him deploying the parachute. For a split second I wondered what to anticipate, and then I felt a strong tug upwards and suddenly we were upright. He adjusted my straps for this position, which felt a bit like being unharnessed. I took on a sitting position in the harness and enjoyed the view when Dave began, “let’s talk about the landing.”

It was another couple of minutes before we finally touched down, and it was all a bit too much to take in during one jump. On the flight up, I was certain that I would never try this a second time, but now, I have a feeling it won’t be the last time I jump out of a perfectly good airplane.

I highly recommend the experience to anyone and everyone. Especially as some have recently been suggesting that experiences are more enjoyed than possessions.

MPI_Datatype

It has been a while since I’ve had to work with MPI, but recently I had to learn a new trick with it. MPI provides ways to convey data between processes in a number of ways, from broadcasts to scatters to all-gathers. But obviously you have to provide a certain amount of information about the structure of the data, not the least of which is the datatype.

MPI defines enumerated constants for the basic data types in C: MPI_CHAR, MPI_INT, etc., and for the most part these will suffice. But what if you want to scatter your own struct? This can be done through a number of utility functions, but the most versatile seems to be MPI_Type_struct().

You let MPI know how many different blocks, or chunks of memory there are, their lengths, their offsets and types, and then what variable to store the resulting integer handle in. So if we had a struct:

1
2
3
4
5
typedef struct {
    int var;
    char string[STRING_LENGTH];
    double foo;
} bar;

We would first indicate that there are three blocks, of lengths 1, STRING_LENGTH and 1:

1
2
int count = 3;
int lengths[3] = {1, STRING_LENGTH, 1};

The offsets indicate the byte offsets from the base address of each of the types in the structure. For this example, the “var” variable is the first, and thus has offset 0. On the other hand, “string” will have an offset that is sizeof(int), and “foo” will appear after the int and the string of length STRING_LENGTH:

1
2
MPI_Aint offsets[3] = {0, sizeof(int), sizeof(int) + STRING_LENGTH};
MPI_Dataype types[3] = {MPI_INT, MPI_CHAR, MPI_DOUBLE};

To finish up, we ask MPI to fill an integer we declare with the handle that will hereafter refer to this struct for the purposes of MPI. Then, we commit it, and it’s ready for use!

1
2
3
4
/* Corrected by commenter `tinku`: int -> MPI_Datatype */
MPI_Datatype barDatatype;
MPI_Type_struct(count, lengths, offsets, types, &barDatatype);
MPI_Type_commit(&barDatatype);

Now if you have an array of bar structs, you can use scatterv with it and your new datatype, or bcast for that matter. It’s business as usual, as if it were any of the include base datatypes in MPI.