Hython: The simplest possible language
Think back to the first piece of code you wrote. It was probably some form of “Hello World.” Do you remember the joy and fascination you felt? The rush of commanding what felt like magic? The sudden realization of potentiality?
Let’s revisit that time by implementing a simple language that supports Hello World. Over time, it will grow to be a Python interpreter. We will slowly implement larger and larger subsets of the Python syntax and semantics. Along the way, we’ll learn about parsing, interpreters, and Haskell. Note that we will only scratch the surface of compilers! In particular, we won’t worry about optimizing our interpreter for speed.
Ideally, you’ll have a bit of exposure to:
- Haskell
- Functional programming
- Compiler design
However, I realize most people will not. I’ll try to provide as much guidance as I can, but please don’t be afraid to thrash around a bit and play. Beginner’s mind will serve you extremely well, especially when learning Haskell. And don’t be afraid to ask on StackOverflow or Freenode’s #nothaskell
channel. Don’t give up! If it helps, here is a similar project I implemented in Ruby to practice this project.
This can be a hard project at times, but it is extremely satisfying. You will learn very quickly when you throw yourself in.
Setup
Follow Chris Allen’s guide to install Haskell on your platform
Use
cabal
to installparsec
:$ cabal install parsec
Grab the source code, and create a branch pointing at the first commit:
$ git clone https://github.com/mattgreen/hython.git $ cd hython $ git checkout -b first 79e34386f5a31ebd6ff7968c645eb485e75e508f
Compile the source code:
$ make
You can also follow along by viewing the specific commit.
Theory Crash Course
Compilers are tasked with a complex job: that of understanding, optimizing, emitting, and possibly executing code written by humans. To manage this complexity, they’re typically broken into discrete stages. Each stage is responsible for only one facet of the compilation process, such as parsing or optimization. It is helpful to think of these stages arranged in a pipeline:
# Pseudocode!
$ lex code.c | parse | interpret
Stages typically accept input in a specific format, and may output a different format. This promotes loose coupling between various stages, each of which has it’s own necessary complexity to worry about. We’ll see how the stages connect together shortly.
For Hython, we’ll focus on two stages right now:
- Parsing: understanding the code written by the user
- Interpreter: executing the code written by the user
Note the separation between understanding and executing. What would this enable?
Specifying our Language
Our language, right now, is composed of print statements. Let’s model that in Haskell:
data Statement = Print String
deriving(Show)
This line introduces a new type, namely a Statement
. It also defines the lone type of statement we support: a Print
statement, which contains a String argument containing the message to be printed.
Parsing
Parsing can be an icky, icky process. But Haskell’s Parsec library makes it much nicer! We’ll lean on that to parse Hython’s grammar. The goal of the parser is to transform correct syntax into an Abstract Syntax Tree (AST), which represents a user’s program. In Hython, the AST is represented by the Statement
type we defined previously.
So how do we parse a print statement, anyway?
statement = do
string "print(\""
content <- many (noneOf "\"")
string "\")"
spaces
return $ Print content
We define a print statement’s syntax as:
- The string
print("
- 1-n of any character, except the closing double-quote. This string is “captured” as the variable
content
- The string
")
- And optional trailing whitespace (such as a newline)
If all four of these rules match in this order, then we create a new Print
statement, passing the content
along. We use a few of Parsec’s provided functions here: string
, many
, noneOf
, and spaces
. The return type of statement
is Statement
.
But since we’re defining a language, we should really support multiple statements. Parsec makes this really easy:
statements = many statement
We define a statement as many
statement
s, passing the statement
function we just defined to the many
combinator. The result type of statements
is [Statement]
, as you’d expect.
Execution
This is where the magic of an interpreter happens. The execution stage receives the AST created by the parser. It walks the AST, executing each Statement
it encounters along the way. Ponder that for a second: we get to define what these Statement
s do!
Now, Haskell is so good at this sort of thing. Observe:
eval :: Statement -> IO ()
eval (Print m) = print m
First, we define the type of the function eval
: a function taking a single Statement as an argument, and returning nothing, indicated by ()
. (We’ll come back to IO
shortly).
Next, we define what eval
does when it is passed a Print
statement: we print it to the screen! We’re using Haskell’s pattern-matching capabilities to define the behavior of eval
when passed a Print
argument. This works because Print
is a type of Statement
, which is the type of the first argument to eval
.
Next, we assign a name to the Print
statement’s message argument, and use that name in the call to print
. In effect, we’re unpacking the data stored in the AST. If we forget this argument, then the program will not compile.
As for IO
, Haskell requires us to annotate functions that have side effects with the IO
tag. It serves as a way to alert the reader: “hey, I might affect the outside world!” without them having to remember what all code paths do. Now, printing something to the screen is certainly a side effect, wouldn’t you say? Thus, we mark eval
with IO
.
Integrating Stages
We’re almost done! We have the two stages defined. Now we need to put them together.
We’ll do that in main
:
main = do
[filename] <- getArgs
code <- readFile filename
case parse statements filename code of
Left e -> print e
Right r -> mapM_ eval r
main
is a bit of a junk drawer, but that’s OK: it’s the central integration point, for now. We start by getting the first argument passed on the command line, and reading that file in. Next, we start the parsing process by calling Parsec’s parse
function, passing our statements
function, along with the filename, and the code we just read in.
parse
returns an Either
type, which is typically used to denote an operation that might fail. We pattern match on the result of the parse
call using case...of
syntax, which grants access to the actual value stored inside. A Left
is typically used to indicate failure, in this case, it contains a string containing an error message, which we assign the name e
to, and print to the screen.
A Right
indicates success (mnemonic: right/correct), and it carries a value with the return type of the statements
function, which is [Statement]
, which we give the name r
. eval
, however, expects a Statement
, not a [Statement]
. What do we do?
What we want to do is run eval
on each Statement
inside of r
. In an imperative language, we’d use a loop here. Haskell provides us the mapM
function to accomplish the same thing. mapM
applies the function passed as the first argument to each item of the list. This gives us the effect we want! Haskell provides the mapM_
variant which discards the result of the operation, which is suitable for our use.
There’s another discussion about map
vs mapM
that could be had here, but I’m not sure I can do it justice, nor do I feel it adds to the discussion. (I’ve intentionally avoided using the m-word throughout.) The short of it is: we use mapM
here because main
’s type is:
main :: IO ()
If main
was pure (e.g. it didn’t have IO
), we would use map
.
Testing
After building, invoke Hython with a test script:
$ ./hython print.py
"Hello"
"This is a test"
Lo and behold, we have defined a language with our bare hands.
Exercises
- Get the code compiling
- Write some more tests. Based on what you know about the syntax, try introducing some syntax errors into the test script and seeing what happens.
- Hython puts quotes around strings, but Python doesn’t. Fix this oversight. (Hint: use something besides
print
) Add support for Python’s
pass
statement:- Add another type of
Statement
using|
- Rename the
statement
parser intoprintStatement
, write apassStatement
parser, create a newstatement
parser that uses Parsec’s<|>
operator to choose between your parsers. - Update
eval
to handle pass statements. What makes this difficult?
- Add another type of