call/cc
code, culture, and a life considered

Hython: The simplest possible language

Think back to the first piece of code you wrote. It was probably some form of “Hello World.” Do you remember the joy and fascination you felt? The rush of commanding what felt like magic? The sudden realization of potentiality?

Let’s revisit that time by implementing a simple language that supports Hello World. Over time, it will grow to be a Python interpreter. We will slowly implement larger and larger subsets of the Python syntax and semantics. Along the way, we’ll learn about parsing, interpreters, and Haskell. Note that we will only scratch the surface of compilers! In particular, we won’t worry about optimizing our interpreter for speed.

Ideally, you’ll have a bit of exposure to:

  • Haskell
  • Functional programming
  • Compiler design

However, I realize most people will not. I’ll try to provide as much guidance as I can, but please don’t be afraid to thrash around a bit and play. Beginner’s mind will serve you extremely well, especially when learning Haskell. And don’t be afraid to ask on StackOverflow or Freenode’s #nothaskell channel. Don’t give up! If it helps, here is a similar project I implemented in Ruby to practice this project.

This can be a hard project at times, but it is extremely satisfying. You will learn very quickly when you throw yourself in.

Setup

  1. Follow Chris Allen’s guide to install Haskell on your platform

  2. Use cabal to install parsec:

    $ cabal install parsec
    
  3. Grab the source code, and create a branch pointing at the first commit:

    $ git clone https://github.com/mattgreen/hython.git
    $ cd hython
    $ git checkout -b first 79e34386f5a31ebd6ff7968c645eb485e75e508f
    
  4. Compile the source code:

    $ make
    

You can also follow along by viewing the specific commit.

Theory Crash Course

Compilers are tasked with a complex job: that of understanding, optimizing, emitting, and possibly executing code written by humans. To manage this complexity, they’re typically broken into discrete stages. Each stage is responsible for only one facet of the compilation process, such as parsing or optimization. It is helpful to think of these stages arranged in a pipeline:

# Pseudocode!
$ lex code.c | parse | interpret

Stages typically accept input in a specific format, and may output a different format. This promotes loose coupling between various stages, each of which has it’s own necessary complexity to worry about. We’ll see how the stages connect together shortly.

For Hython, we’ll focus on two stages right now:

  1. Parsing: understanding the code written by the user
  2. Interpreter: executing the code written by the user

Note the separation between understanding and executing. What would this enable?

Specifying our Language

Our language, right now, is composed of print statements. Let’s model that in Haskell:

data Statement = Print String
    deriving(Show)

This line introduces a new type, namely a Statement. It also defines the lone type of statement we support: a Print statement, which contains a String argument containing the message to be printed.

Parsing

Parsing can be an icky, icky process. But Haskell’s Parsec library makes it much nicer! We’ll lean on that to parse Hython’s grammar. The goal of the parser is to transform correct syntax into an Abstract Syntax Tree (AST), which represents a user’s program. In Hython, the AST is represented by the Statement type we defined previously.

So how do we parse a print statement, anyway?

statement = do
    string "print(\""
    content <- many (noneOf "\"")
    string "\")"
    spaces
    return $ Print content

We define a print statement’s syntax as:

  1. The string print("
  2. 1-n of any character, except the closing double-quote. This string is “captured” as the variable content
  3. The string ")
  4. And optional trailing whitespace (such as a newline)

If all four of these rules match in this order, then we create a new Print statement, passing the content along. We use a few of Parsec’s provided functions here: string, many, noneOf, and spaces. The return type of statement is Statement.

But since we’re defining a language, we should really support multiple statements. Parsec makes this really easy:

statements = many statement

We define a statement as many statements, passing the statement function we just defined to the many combinator. The result type of statements is [Statement], as you’d expect.

Execution

This is where the magic of an interpreter happens. The execution stage receives the AST created by the parser. It walks the AST, executing each Statement it encounters along the way. Ponder that for a second: we get to define what these Statements do!

Now, Haskell is so good at this sort of thing. Observe:

eval :: Statement -> IO ()
eval (Print m) = print m

First, we define the type of the function eval: a function taking a single Statement as an argument, and returning nothing, indicated by (). (We’ll come back to IO shortly).

Next, we define what eval does when it is passed a Print statement: we print it to the screen! We’re using Haskell’s pattern-matching capabilities to define the behavior of eval when passed a Print argument. This works because Print is a type of Statement, which is the type of the first argument to eval.

Next, we assign a name to the Print statement’s message argument, and use that name in the call to print. In effect, we’re unpacking the data stored in the AST. If we forget this argument, then the program will not compile.

As for IO, Haskell requires us to annotate functions that have side effects with the IO tag. It serves as a way to alert the reader: “hey, I might affect the outside world!” without them having to remember what all code paths do. Now, printing something to the screen is certainly a side effect, wouldn’t you say? Thus, we mark eval with IO.

Integrating Stages

We’re almost done! We have the two stages defined. Now we need to put them together.

We’ll do that in main:

main = do
    [filename] <- getArgs
    code <- readFile filename

    case parse statements filename code of
        Left e  -> print e
        Right r -> mapM_ eval r

main is a bit of a junk drawer, but that’s OK: it’s the central integration point, for now. We start by getting the first argument passed on the command line, and reading that file in. Next, we start the parsing process by calling Parsec’s parse function, passing our statements function, along with the filename, and the code we just read in.

parse returns an Either type, which is typically used to denote an operation that might fail. We pattern match on the result of the parse call using case...of syntax, which grants access to the actual value stored inside. A Left is typically used to indicate failure, in this case, it contains a string containing an error message, which we assign the name e to, and print to the screen.

A Right indicates success (mnemonic: right/correct), and it carries a value with the return type of the statements function, which is [Statement], which we give the name r. eval, however, expects a Statement, not a [Statement]. What do we do?

What we want to do is run eval on each Statement inside of r. In an imperative language, we’d use a loop here. Haskell provides us the mapM function to accomplish the same thing. mapM applies the function passed as the first argument to each item of the list. This gives us the effect we want! Haskell provides the mapM_ variant which discards the result of the operation, which is suitable for our use.

There’s another discussion about map vs mapM that could be had here, but I’m not sure I can do it justice, nor do I feel it adds to the discussion. (I’ve intentionally avoided using the m-word throughout.) The short of it is: we use mapM here because main’s type is:

main :: IO ()

If main was pure (e.g. it didn’t have IO), we would use map.

Testing

After building, invoke Hython with a test script:

$ ./hython print.py
"Hello"
"This is a test"

Lo and behold, we have defined a language with our bare hands.

Exercises

  1. Get the code compiling
  2. Write some more tests. Based on what you know about the syntax, try introducing some syntax errors into the test script and seeing what happens.
  3. Hython puts quotes around strings, but Python doesn’t. Fix this oversight. (Hint: use something besides print)
  4. Add support for Python’s pass statement:

    1. Add another type of Statement using |
    2. Rename the statement parser into printStatement, write a passStatement parser, create a new statement parser that uses Parsec’s <|> operator to choose between your parsers.
    3. Update eval to handle pass statements. What makes this difficult?

Hython: Introduction

For the first part of 2014, I felt highly unfocused. I was a bit bored and listless. I started a few side projects, but nothing seemed to stick. I chalked it up to winter blues and put them aside.

A few months later, I had an insight: I only attempt side projects I knew I could finish.

Little wonder I felt stagnant! I’d forgotten that half the thrill of a side project is attempting things that might not pan out. Instead, I viewed side projects as something that needed to be released ASAP. I’d sold my craft to the Church of Lean. I’d stopped training. To instill accountability, I resolved to work on my next project out in the open, however imperfect it was.

Thus, I married two nascent interests: compilers and functional programming. Whereas my previous study in programming language design stayed simple, I felt ready to tackle a real language, with the necessary ugliness it entailed. I wanted the fear of getting my ass kicked by a project of this scope. And I wanted a language that was well fit for the problem. In the age of Blub Languages, I wanted a powerful language that expected more of me than I was used to. Haskell fit the bill perfectly.

I began Hython on July 2nd, 2014, severely underestimating the scope of work to be done. This was a blessing: it protected me from the fears and doubts that would have provided an easy way out. This is reflected in my first commit: a simple interpreter which only handles print statements and blank lines.

I’ll be walking through the commit history with a series of posts, explaining as I go. We’ll make some poor assumptions and gross simplifications along the way, but that’s OK: we are learning, after all. I hope to show you that a study of compilers is actually a study of computation itself. And computation, in it’s purest form, is something very beautiful and creative.

That is my real reason for learning all of this.

A Culture of Easy

I’ve been haunted by the distinction of simple and easy.

Specifically, I’ve wondered why developers confuse these ideas over and over. Wouldn’t they, of all people, understand the pain that comes when things don’t work? After all, their job is to build things, and ensure that they work! Why can’t they see that easy often comes at a price? Are most developers just bad at their jobs?

No. But there’s more to it than that.

You see, we live in a culture of easy. Aspirational advertising promises highly intangible goals (such as respect) as outcomes of buying certain products. Social networking offers the promise of human connection while doing everything to keep us generating ad revenue. Fast food lets us eat without putting much time into preparation. But, all of those things are mere shadows of the actual reality. You fake respect by buying luxury goods, you use social networking to plan to meet someone face to face, and you don’t want to eat fast food every day. Knowing this doesn’t render you immune to the effects, unfortunately.

That consumer mindset is easy to bring into other areas of life, and software is no exception. Following the buzz of the tech community is like injecting a giant dose of marketing into your brain. The worst part about it is: you don’t even realize it’s happening. You see people talking a lot about Node, or Docker, or Go, and some part of you figures, “these must be important technologies, or else nobody would be talking about them!” Except, you don’t really know their technical merits, you’ve only inferred this based upon the fact that discussion about them was occuring. After all, they wouldn’t be new if they weren’t better than what’s already out there, right? Surely everyone can’t be fooled!

Except, everyone is often fooled. In a consumer culture, we’re trained to be fooled. We’re taught to be passive, and we’re taught to consume shiny things. New technology is marketed to us in exactly the same way as consumer products are. The feedback cycle of effusive blog posts, heavy marketing, corporate backing, and proclamations of “this time it’s different” create hype. It tells us a story we want to believe. And because we want to believe it, we fall for it.

Meanwhile, the Software Problem rages on, as bad as it ever was. Strangely, nobody seems to notice that the Previous Great Thing didn’t really fix it.

Maybe the Next Big Thing will save us? Don’t let us down, Hacker News!

The Lost Art of Denotational Semantics

Erik Meijer seems to have a knack for explaining complex theory in an entertaining way. When I first saw Erik speak in person at Lang.NEXT 2014, I couldn’t help but smile in nerdy glee: he was unafraid to wade deep into the theory while being enjoyable and occasionally irreverent. While I’d love to cover that talk at a later date, for this post I’d like to focus on another talk he gave his year, entitled The Lost Art of Denotational Semantics. Erik does an excellent job explaining the concept and application of denotational semantics, which I’d never heard of before.

The talk serves as a good introduction to continuation-passing style, which is often used in compiler construction. It also shows the beauty and power of functional programming in solving complex problems. A mix of Scala and a more abstract math-like notation are used.

Notes:

  • How do we determine what our programming language actually does? Define it in terms of code!
  • In more concrete terms, denotational semantics is realized by writing an interpreter for a programming language.
  • To start, Erik derives sequential composition (do this, then that) via continuation passing.
  • How do we handle return statements? Easy: pass in another continuation!
  • Exception handling can be expressed succinctly via denotational semantics as well.
  • “Look how deeply nested these lambdas are!”
  • Contiuation passing is really powerful, but how do we simplify the machinery involved in continuation passing?
  • Erik backs into a simplified intermediate representation (IR) that is simpler to define the semantics for.
  • The IR ends up being imperative: gotos + hidden global variables!
  • What we’re really doing is defining the semantics of something complicated (try/finally) in terms of something very simple (IR).
  • This desugaring makes it much easier to generate a compiler from an interpreter.
  • Discussion of catamorphism and anamorphism, which I’m unable to appreciate at this time.

Simple Made Easy

To counter the pop-culture of programming, I’m going to be doing an on-going series where I highlight talks and papers that I’ve found worthwhile. We’ll start with Rich Hickey’s famous talk Simple vs Easy: http://www.infoq.com/presentations/Simple-Made-Easy

Rich has a knack for reducing a problem to it’s purest essence. It is a pleasure being challenged by his vision of radical simplicity.

My notes:

  • Simple is an objective notion: we can determine whether it is entangled with something else
  • Easy seduces us with familiarity: whether it’s easy to obtain (gem install), close to our capabilities, or familiar. It is a relative notion.
  • We focus too much on the experience constructing it rather than the long-term implications it has.
  • Familiarity (part of easy) makes programmers interchangeable
  • TDD is “guard-rail-oriented programming”, Hickey emphasizes reasoning about the program itself.
  • When evaluating program constructs, what matters to us is the complexity that emerges from their use. Despite some constructs being easy, they may create more complexity that is purely incidental to the problem.
  • Simplicity yields ease of understanding, change, debugging.
  • Composing simple components is the key to robust software.
  • Separating modules doesn’t imply simplicity, nor does code organization.
  • State complects value and time. It entangles everything that it touches (directly or indirectly). It cannot be contained by modules or encapsulation.
  • State cannot be made simple, even when it is explicit as in Clojure or Haskell.
  • “You can’t even begin to talk about how complicated an ORM is”
  • Represent data as data rather than wrapping it up in objects
  • Simplicity is a choice, we live in a culture of complexity
  • Easy is not simple. You must develop sensibilities about entanglement.

Fear of threads

“Threads are hard, so you shouldn’t use them.” — the Internet

Unfortunately, I encounter this sentiment frequently. The subtext is: “threads are dangerous and can corrupt your state, so you shouldn’t use them at all.” While not technically incorrect, I find this belief damaging to hold, if only because it is intellectually stunting.

Threads are a powerful tool, and one that you will inevitably have to use, even if indirectly. They force you to think very deliberately about your data flow. They quickly expose whether your design is solid, because they require you to decouple the arrangement of work from the actual work at hand. Threads, due to their low-level nature, are not something you can drop in and forget about. The realities of your runtime environment (and the quality of it’s implementation) cannot be ignored.

I believe the majority of the benefit of threads comes from the act of using them. Due to the harsh realities of sharing mutable state between threads, most apps settle on a more sane strategy akin to passing messages between threads. This has a powerful effect on abstraction: worker threads are more easily conform to SRP because they simply don’t have access to the rest of the world. They’re isolated from the rest of the app for reasons of safety.

Currently, our tools are extremely primitive for working with threads. Blocking the runloop should be viewed as a amateur mistake: the UI thread is only for UI work, period. Real work happens in a background thread, with messages flying between threads to notify the user.

For the love of frameworks

Most developers have strong feelings on frameworks.

I’ve noticed that the fervor typically associated with languages has transferred to frameworks. We have conferences and meetups around frameworks now. This was inevitable: given sufficiently powerful languages, frameworks are mini-languages unto themselves. In fact, frameworks can put languages on the map! Ruby made such a big splash because of Rails. Thus, framework development is very similar to programming language development, just at a higher conceptual level.

But as the line between language and framework continues to blur, the tendency is to treat the framework as an equal to the language; namely, that it is required. This is a harmful myopia. My worry stems from the fact that it is easy to use the framework as a design crutch. Instead of making intentional decisions about how to structure an app, you outsource the architecture to a piece of code you don’t control. In the process, you’re also conceding part of your design budget to the framework which will be very costly to regain in the future. More importantly, your own ability to design will atrophy from lack of use.

This doesn’t mean you shouldn’t use frameworks. It means they should be a calculated cost, and seen for what they are: commoditized, rapid application development tools. But they are both volatile and ephemeral. And they do not automatically confer a good architecture on your application. Frameworks merely give it a safe, one-size-fits-all design. Generally, the more simple your problem is, the more likely you can use a framework to handle it.