On Being an Artifact Evaluator

I had the pleasure of serving as a committee member for this year's PLDI Artifact Evaluation Process. After reading Lindsey Kuper's post from the author's point of view, I thought I'd say a little about my perspective from the other side. I had a lot of fun doing this, and it's exciting to think about the implications for our field as artifact evaluation becomes a more common thing at conferences.

The biggest challenge for me was deciding what are the expectations for a high quality artifact. Artifact evaluation is still quite new and many of us have not seen many if any artifacts from other papers, meaning we don't have a clear set of expectations as a community. With papers, on the other hand, things are clearer. We've read our share of good and bad papers, and as a reviewer we generally know that we expect the paper to make claims about a clearly motivated problem and provide sufficient evidence that their solution works. We expect papers to be well-written, although as long as grammatical issues do not seriously hamper understanding the paper, we are pretty forgiving about this.

In the context of artifacts, I was interested first of all in whether I could run the software and verify that it does what the paper claims it does. Secondly, I wanted to see whether I felt I could use the code as a starting point if I were to try to do more research built on the claims of the paper.

The second point, especially, challenges the common practice for research code. Research code is notorious for being duct taped and hacked together at the last minute in order to run just barely well enough to generate the graphs in the paper. In one of my early projects as a grad student, our system involved S-Expressions embedded in C++ code, with a Python script to extract them, run them through a compiler written in Scheme and then stitch the generated C++ code back into the original program. As far as I know, the only programs that ran were the benchmarks in our paper. Even worse, there wasn't a single SVN revision that worked. You'd need some files from one revision, and other files from another revision, depending on which compiler features you wanted to enable. This is exactly the kind of horror that drove Matt Might to create the CRAPL.

This is also a challenge to artifact evaluation. On the one hand, research is about doing whatever is necessary to prove a concept. We should not expect research code to be Version 1.0 production quality. On the other hand, research is about repeatability and being able to build off previous results. In the project I just mentioned, three days after the paper deadline, I had forgotten with magic combination of SVN revisions to make the code work. If I couldn't remember how to run my own software, how could I expect anyone else to? This is one of the reasons that we started building an extensive test suite at the very beginning and tried to make the zero-to-tests-running process as simple as possible.

One issue that Lindsey mentioned in her post was that her paper (and several others) relied on some commercial software that they could not redistribute. This creates more tension as a reviewer, because while researchers should be allowed and even encouraged to use whatever tools help them get better results faster, the fact is this also makes it harder for me do evaluate their work. Fortunately, many of us in the research community are part of institutions that have licenses to a variety of software, meaning at least one of the reviewers likely has the necessary software. In spite of these challenges, I don't want us to get to a place where there is de facto requirement that all research code must be built only with GPL software.

I'll close with a list of things that I found really helpful in the artifacts I saw.

Provide a virtual machine. Most projects will have a good number of dependencies. The easiest way for me to get the configuration right is to give me a virtual machine image. A really nice VM will boot up to a screen that makes it really obvious what to do next, like a terminal with a welcome message telling me what program to run to repeat the experiment. This makes it easy for me to verify that there does exist software that does what the paper says it does.
Put READMEs in every directory. Each directory should have a README that says what I should be looking for in this directory, and what I can find in each of the subdirectories. If you release your code through GitHub, GitHub makes these READMEs look really nice in their web interface.
Give me a way to install the software on my own machine. While the VM is great for letting me make sure the software actually runs, I'm not going to want to do all my development in a VM if I chose to build off your work. GitHub repos are a great way to do this. Be sure to include a list of dependencies in the README, particularly when specific versions are needed.
Include a script to generate the results in the paper. Running each of the experiments and looking at the results can be a pretty tedious process. If you can completely automate this, my job (and your own, as an author) gets a lot easier. I just have to start the script and when it's all done I can see whether the graphs I ended up with look roughly the same as the ones in the paper.

With the exception of creating a virtual machine, these suggestions not only make the artifact evaluator's job easier, they also make conducting research easier. READMEs and an easy way to install the software will help bring new people onboard the project. Being able to automatically regenerate all the graphs in a paper makes it easy to quickly see the impact of changes in your software. It may seem like a lot of time at first, I'm convinced doing this early will more than make up for it in the long run.