Think back to the first piece of code you wrote. It was probably some form of “Hello World.” Do you remember the joy and fascination you felt? The rush of commanding what felt like magic? The sudden realization of potentiality?
Let’s revisit that time by implementing a simple language that supports Hello World. Over time, it will grow to be a Python interpreter. We will slowly implement larger and larger subsets of the Python syntax and semantics. Along the way, we’ll learn about parsing, interpreters, and Haskell. Note that we will only scratch the surface of compilers! In particular, we won’t worry about optimizing our interpreter for speed.
Ideally, you’ll have a bit of exposure to:
- Functional programming
- Compiler design
However, I realize most people will not. I’ll try to provide as much guidance as I can, but please don’t be afraid to thrash around a bit and play. Beginner’s mind will serve you extremely well, especially when learning Haskell. And don’t be afraid to ask on StackOverflow or Freenode’s
#nothaskell channel. Don’t give up! If it helps, here is a similar project I implemented in Ruby to practice this project.
This can be a hard project at times, but it is extremely satisfying. You will learn very quickly when you throw yourself in.
Follow Chris Allen’s guide to install Haskell on your platform
$ cabal install parsec
Grab the source code, and create a branch pointing at the first commit:
$ git clone https://github.com/mattgreen/hython.git $ cd hython $ git checkout -b first 79e34386f5a31ebd6ff7968c645eb485e75e508f
Compile the source code:
You can also follow along by viewing the specific commit.
Theory Crash Course
Compilers are tasked with a complex job: that of understanding, optimizing, emitting, and possibly executing code written by humans. To manage this complexity, they’re typically broken into discrete stages. Each stage is responsible for only one facet of the compilation process, such as parsing or optimization. It is helpful to think of these stages arranged in a pipeline:
# Pseudocode! $ lex code.c | parse | interpret
Stages typically accept input in a specific format, and may output a different format. This promotes loose coupling between various stages, each of which has it’s own necessary complexity to worry about. We’ll see how the stages connect together shortly.
For Hython, we’ll focus on two stages right now:
- Parsing: understanding the code written by the user
- Interpreter: executing the code written by the user
Note the separation between understanding and executing. What would this enable?
Specifying our Language
Our language, right now, is composed of print statements. Let’s model that in Haskell:
data Statement = Print String deriving(Show)
This line introduces a new type, namely a
Statement. It also defines the lone type of statement we support: a
Parsing can be an icky, icky process. But Haskell’s Parsec library makes it much nicer! We’ll lean on that to parse Hython’s grammar. The goal of the parser is to transform correct syntax into an Abstract Syntax Tree (AST), which represents a user’s program. In Hython, the AST is represented by the
Statement type we defined previously.
So how do we parse a print statement, anyway?
statement = do string "print(\"" content <- many (noneOf "\"") string "\")" spaces return $ Print content
We define a print statement’s syntax as:
- The string
- 1-n of any character, except the closing double-quote. This string is “captured” as the variable
- The string
- And optional trailing whitespace (such as a newline)
If all four of these rules match in this order, then we create a new
content along. We use a few of Parsec’s provided functions here:
spaces. The return type of
But since we’re defining a language, we should really support multiple statements. Parsec makes this really easy:
statements = many statement
We define a statement as
statements, passing the
statement function we just defined to the
many combinator. The result type of
[Statement], as you’d expect.
This is where the magic of an interpreter happens. The execution stage receives the AST created by the parser. It walks the AST, executing each
Statement it encounters along the way. Ponder that for a second: we get to define what these
Now, Haskell is so good at this sort of thing. Observe:
eval :: Statement -> IO () eval (Print m) = print m
First, we define the type of the function
eval: a function taking a single Statement as an argument, and returning nothing, indicated by
(). (We’ll come back to
Next, we define what
eval does when it is passed a
eval when passed a
Statement, which is the type of the first argument to
Next, we assign a name to the
IO, Haskell requires us to annotate functions that have side effects with the
IO tag. It serves as a way to alert the reader: “hey, I might affect the outside world!” without them having to remember what all code paths do. Now, printing something to the screen is certainly a side effect, wouldn’t you say? Thus, we mark
We’re almost done! We have the two stages defined. Now we need to put them together.
We’ll do that in
main = do [filename] <- getArgs code <- readFile filename case parse statements filename code of Left e -> print e Right r -> mapM_ eval r
main is a bit of a junk drawer, but that’s OK: it’s the central integration point, for now. We start by getting the first argument passed on the command line, and reading that file in. Next, we start the parsing process by calling Parsec’s
parse function, passing our
statements function, along with the filename, and the code we just read in.
parse returns an
Either type, which is typically used to denote an operation that might fail. We pattern match on the result of the
parse call using
case...of syntax, which grants access to the actual value stored inside. A
Left is typically used to indicate failure, in this case, it contains a string containing an error message, which we assign the name
e to, and print to the screen.
Right indicates success (mnemonic: right/correct), and it carries a value with the return type of the
statements function, which is
[Statement], which we give the name
eval, however, expects a
Statement, not a
[Statement]. What do we do?
What we want to do is run
eval on each
Statement inside of
r. In an imperative language, we’d use a loop here. Haskell provides us the
mapM function to accomplish the same thing.
mapM applies the function passed as the first argument to each item of the list. This gives us the effect we want! Haskell provides the
mapM_ variant which discards the result of the operation, which is suitable for our use.
There’s another discussion about
mapM that could be had here, but I’m not sure I can do it justice, nor do I feel it adds to the discussion. (I’ve intentionally avoided using the m-word throughout.) The short of it is: we use
mapM here because
main’s type is:
main :: IO ()
main was pure (e.g. it didn’t have
IO), we would use
After building, invoke Hython with a test script:
$ ./hython print.py "Hello" "This is a test"
Lo and behold, we have defined a language with our bare hands.
- Get the code compiling
- Write some more tests. Based on what you know about the syntax, try introducing some syntax errors into the test script and seeing what happens.
- Hython puts quotes around strings, but Python doesn’t. Fix this oversight. (Hint: use something besides
Add support for Python’s
- Add another type of
- Rename the
printStatement, write a
passStatementparser, create a new
statementparser that uses Parsec’s
<|>operator to choose between your parsers.
evalto handle pass statements. What makes this difficult?
- Add another type of