call/cc
code, culture, and a life considered

Hython: The simplest possible language

Think back to the first piece of code you wrote. It was probably some form of “Hello World.” Do you remember the joy and fascination you felt? The rush of commanding what felt like magic? The sudden realization of potentiality?

Let’s revisit that time by implementing a simple language that supports Hello World. Over time, it will grow to be a Python interpreter. We will slowly implement larger and larger subsets of the Python syntax and semantics. Along the way, we’ll learn about parsing, interpreters, and Haskell. Note that we will only scratch the surface of compilers! In particular, we won’t worry about optimizing our interpreter for speed.

Ideally, you’ll have a bit of exposure to:

  • Haskell
  • Functional programming
  • Compiler design

However, I realize most people will not. I’ll try to provide as much guidance as I can, but please don’t be afraid to thrash around a bit and play. Beginner’s mind will serve you extremely well, especially when learning Haskell. And don’t be afraid to ask on StackOverflow or Freenode’s #nothaskell channel. Don’t give up! If it helps, here is a similar project I implemented in Ruby to practice this project.

This can be a hard project at times, but it is extremely satisfying. You will learn very quickly when you throw yourself in.

Setup

  1. Follow Chris Allen’s guide to install Haskell on your platform

  2. Use cabal to install parsec:

    $ cabal install parsec
    
  3. Grab the source code, and create a branch pointing at the first commit:

    $ git clone https://github.com/mattgreen/hython.git
    $ cd hython
    $ git checkout -b first 79e34386f5a31ebd6ff7968c645eb485e75e508f
    
  4. Compile the source code:

    $ make
    

You can also follow along by viewing the specific commit.

Theory Crash Course

Compilers are tasked with a complex job: that of understanding, optimizing, emitting, and possibly executing code written by humans. To manage this complexity, they’re typically broken into discrete stages. Each stage is responsible for only one facet of the compilation process, such as parsing or optimization. It is helpful to think of these stages arranged in a pipeline:

# Pseudocode!
$ lex code.c | parse | interpret

Stages typically accept input in a specific format, and may output a different format. This promotes loose coupling between various stages, each of which has it’s own necessary complexity to worry about. We’ll see how the stages connect together shortly.

For Hython, we’ll focus on two stages right now:

  1. Parsing: understanding the code written by the user
  2. Interpreter: executing the code written by the user

Note the separation between understanding and executing. What would this enable?

Specifying our Language

Our language, right now, is composed of print statements. Let’s model that in Haskell:

data Statement = Print String
    deriving(Show)

This line introduces a new type, namely a Statement. It also defines the lone type of statement we support: a Print statement, which contains a String argument containing the message to be printed.

Parsing

Parsing can be an icky, icky process. But Haskell’s Parsec library makes it much nicer! We’ll lean on that to parse Hython’s grammar. The goal of the parser is to transform correct syntax into an Abstract Syntax Tree (AST), which represents a user’s program. In Hython, the AST is represented by the Statement type we defined previously.

So how do we parse a print statement, anyway?

statement = do
    string "print(\""
    content <- many (noneOf "\"")
    string "\")"
    spaces
    return $ Print content

We define a print statement’s syntax as:

  1. The string print("
  2. 1-n of any character, except the closing double-quote. This string is “captured” as the variable content
  3. The string ")
  4. And optional trailing whitespace (such as a newline)

If all four of these rules match in this order, then we create a new Print statement, passing the content along. We use a few of Parsec’s provided functions here: string, many, noneOf, and spaces. The return type of statement is Statement.

But since we’re defining a language, we should really support multiple statements. Parsec makes this really easy:

statements = many statement

We define a statement as many statements, passing the statement function we just defined to the many combinator. The result type of statements is [Statement], as you’d expect.

Execution

This is where the magic of an interpreter happens. The execution stage receives the AST created by the parser. It walks the AST, executing each Statement it encounters along the way. Ponder that for a second: we get to define what these Statements do!

Now, Haskell is so good at this sort of thing. Observe:

eval :: Statement -> IO ()
eval (Print m) = print m

First, we define the type of the function eval: a function taking a single Statement as an argument, and returning nothing, indicated by (). (We’ll come back to IO shortly).

Next, we define what eval does when it is passed a Print statement: we print it to the screen! We’re using Haskell’s pattern-matching capabilities to define the behavior of eval when passed a Print argument. This works because Print is a type of Statement, which is the type of the first argument to eval.

Next, we assign a name to the Print statement’s message argument, and use that name in the call to print. In effect, we’re unpacking the data stored in the AST. If we forget this argument, then the program will not compile.

As for IO, Haskell requires us to annotate functions that have side effects with the IO tag. It serves as a way to alert the reader: “hey, I might affect the outside world!” without them having to remember what all code paths do. Now, printing something to the screen is certainly a side effect, wouldn’t you say? Thus, we mark eval with IO.

Integrating Stages

We’re almost done! We have the two stages defined. Now we need to put them together.

We’ll do that in main:

main = do
    [filename] <- getArgs
    code <- readFile filename

    case parse statements filename code of
        Left e  -> print e
        Right r -> mapM_ eval r

main is a bit of a junk drawer, but that’s OK: it’s the central integration point, for now. We start by getting the first argument passed on the command line, and reading that file in. Next, we start the parsing process by calling Parsec’s parse function, passing our statements function, along with the filename, and the code we just read in.

parse returns an Either type, which is typically used to denote an operation that might fail. We pattern match on the result of the parse call using case...of syntax, which grants access to the actual value stored inside. A Left is typically used to indicate failure, in this case, it contains a string containing an error message, which we assign the name e to, and print to the screen.

A Right indicates success (mnemonic: right/correct), and it carries a value with the return type of the statements function, which is [Statement], which we give the name r. eval, however, expects a Statement, not a [Statement]. What do we do?

What we want to do is run eval on each Statement inside of r. In an imperative language, we’d use a loop here. Haskell provides us the mapM function to accomplish the same thing. mapM applies the function passed as the first argument to each item of the list. This gives us the effect we want! Haskell provides the mapM_ variant which discards the result of the operation, which is suitable for our use.

There’s another discussion about map vs mapM that could be had here, but I’m not sure I can do it justice, nor do I feel it adds to the discussion. (I’ve intentionally avoided using the m-word throughout.) The short of it is: we use mapM here because main’s type is:

main :: IO ()

If main was pure (e.g. it didn’t have IO), we would use map.

Testing

After building, invoke Hython with a test script:

$ ./hython print.py
"Hello"
"This is a test"

Lo and behold, we have defined a language with our bare hands.

Exercises

  1. Get the code compiling
  2. Write some more tests. Based on what you know about the syntax, try introducing some syntax errors into the test script and seeing what happens.
  3. Hython puts quotes around strings, but Python doesn’t. Fix this oversight. (Hint: use something besides print)
  4. Add support for Python’s pass statement:

    1. Add another type of Statement using |
    2. Rename the statement parser into printStatement, write a passStatement parser, create a new statement parser that uses Parsec’s <|> operator to choose between your parsers.
    3. Update eval to handle pass statements. What makes this difficult?