Why Write Compilers in Scheme?

One of the questions Klint Finley asked me for the Wired article about Harlan was "Why Scheme?" I wasn't really satisfied with my answer, so I thought I'd answer it more completely here. Besides the fact that we were already experienced writing compilers in Scheme, Scheme has a lot of features that are very useful to compiler writers.

We were already familiar with Scheme🔗

Indiana University has for years taught their introductory computer science courses in Scheme. The more advanced programming language concepts and compilers courses are also taught in Scheme. This means we had were starting with a lot of experience in writing compilers and exploring programming languages with Scheme. In fact, Will Byrd did his dissertation on a logic programming language embedded in Scheme. We were starting off as Scheme experts.

Joel Spolsky advises not to "start a new project without at least one architect with several years of solid experience in the language." In our case, we knew how to implement languages in Scheme, and were proficient at some of the powerful tools available to Schemers, like the match macro. We considered other tools, like ROSE or Wave, but we found that we were losing too much productivity in the time it took to learn the tool. Early on, I even tried modifying Clang. While I found the Clang code to be very well designed and easy to modify, it was also clear the they had built a production compiler which made it less suitable for the kind of experimentation we wanted to do. The Clang code was set up so that failing to implement proper error messages led to compiler errors. These things are one of the reasons Clang is such a fantastic compiler now, but we had different goals.

Scheme naturally supports trees🔗

Lisp, the ancestor of Scheme, is an abbreviation for "list processing." The natural data structure in Lisp and Scheme is the list, and trees are just lists of lists. Compilers spend a lot of their time manipulating abstract syntax trees, so using a language that's built to manipulate trees makes a lot of sense.

Experimentation is easy🔗

We built Harlan using a the nanopass philosophy. The compiler is structured as many passes that are each responsible for making one fairly small change to the language. This has a lot of benefits. It's been easy to refactor or rewrite portions of the compiler as necessary. Most passes can be written or rewritten in an afternoon. Each pass is easier to debug because it's only concerned with a small portion of the language. We can insert optimization steps at the right level of abstraction. For example, kernel fusion happens relatively early because it is a high level, structural optimization. On the other hand, it would make sense to put more traditional compiler optimizations, such as common subexpression elimination a bit later when the language is lower level.

DSLs in Scheme are the norm🔗

Macros are a favorite feature of any serious Schemer. Basically, they let you build the perfect language for whatever program you happen to be writing. We realized early on that with our many compiler passes, we could run into problems where language forms might leak further through the compiler than they're meant to, or a poorly written pass might generate outright invalid code. To combat this, Claire wrote an awesome grammar verifier, which checks to make sure that the output of each pass conforms to some grammar. At times it was tedious to have to go through and update the grammars when the language changed, but overall I'm convinced that this has saved us time by forcing us to keep the code quality high.

Andy Keep has pushed this idea to the extreme with his Nanopass Framework, which we've started using in the newer Harlan passes. With the Nanopass Framework, you can define a language, and use this to guide the development of compiler passes. Each pass transforms one language into another, and subsequent languages are specified in terms of deltas from the previous language. The Nanopass Framework makes sure that all language terms are well formed. Language terms are represented as records, which are more efficient than plain old lists. The coolest part of the Nanopass Framework, however, is that it can automatically generate many of the transformers. This lets you focus on the portions of the language that change, rather than duplicating language forms that remain the same.

You can save parsing for later🔗

Parsing isn't a terribly interesting problem. One of the great consequences of Scheme's homoiconicity is that you basically have a universal grammar built in. Rather than writing a parser, we just read in S-expressions from a file and go to town. We originally envisioned a C-like syntax for Harlan, but decided we wanted to focus on the implementation and do parsing later.

As I understand it, this is basically how Lisp's syntax came about. The implementors didn't want to spend a lot of time on parsing at first, so they decided to write programs in the actual abstract syntax tree. Over time, the syntax stuck. A similar thing happened with Harlan. We intended to design a more traditional syntax for the language, but the fact that we had S-expressions influenced the design of the language. For example, we realized many of the forms we wanted in Harlan could be implemented as source level transformations, and so we added a macro system to enable this.