Write A Template Compiler For Erlang

February 21, 2009

Adapted from a talk I gave at the 2009 ORD Camp.

Erlang is an advanced distributed software platform, one that is great for network and multi-core computing. I believe it is almost ready for the "prime time" in web programming, but there is at least one big issue holding the Erlang platform back: the Erlang language is a poor fit for many problems that web programmers face. For example:

Erlang is hard to refactor
There is no built-in syntax for hash maps
String manipulation is hard

I am going to talk about one way to circumvent these problems, which is to implement another language to run on the Erlang platform. If you are ambitious you might implement a full-featured programming language, but here I’m going to talk about writing a compiler for a simple template language.

Know Your Target

The Erlang platform is a virtual machine. Erlang code gets compiled down into binary code (byte code) called BEAM, but it takes some intermediate forms along the way. As a compiler writer you will first need to choose your target platform, which can be one of:

Erlang - highest level, what programmers write
Core Erlang - lower level, but readable by programmers
Intermediate code - even lower level, and not readable by you or me
BEAM - byte code

To target #2, 3, or 4, you will need to learn more about Erlang’s internals. I don’t know much about Erlang’s internals, so I am going to target #1, the Erlang language. That is, my compiler will just turn a template file into Erlang code and then call the standard Erlang compiler. So actually, we’re cheating, but at the end of the day we are still producing compiled modules, and no one has to know we’ve cheated.

Compiler Step #0: The Language

First you need to choose a language to compile. It helps to get a hold of a language specification if there is one. I am implementing the Django Template Language. There’s not really a spec, but there is an official implementation in Python, so I can consult that whenever I’m not sure about something.

The rest of this tutorial takes examples from my and Roberto Saccon’s ErlyDTL project.

Compiler Piece #1: The Scanner

The first step towards implementing a language is to write a scanner. The scanner reads in a file and decides what’s a token. A token can be a string, a bracket, an identifier, a language keyword, or that sort of thing. The scanner just returns a flat list of tokens with some information about which line each token was on. With Erlang, you can use a program called Leex to help you write a scanner, or you can write a scanner in plain Erlang. I didn’t know about Leex when I was writing a scanner, so we’re just going to do it the old-fashioned way.

Here’s the important stuff in ErlyDTL’s scanner module (erlydtl_scanner.erl). First the boring stuff:

-module(erlydtl_scanner).
-export([scan/1]).

scan/1 is going to take in a list which represents the source code we’re compiling.

scan(Template) ->
    scan(Template, [], {1, 1}, in_text).

As you can see, it just calls scan/4, whose arguments are:

List of characters remaining to be scanned
List of tokens we’ve scanned so far
Current position {Row, Column}
Current state

The state is just an atom we pass around to tell us whether where we are in the scanning process, e.g. whether we’re waiting for a string to get closed. It’s just like the state in a Finite-State Machine from your Theory of Computation course. Actually our scanner will be a finite-state machine. Each function clause will define a transition rule. Here’s one such rule:

scan("{{" ++ T, Scanned, {Row, Column}, in_text) ->
    scan(T, [{open_var, {Row, Column}, "{{"} | Scanned], {Row, Column+2}, {in_code, "}}"});

That means “if we’re in state in_text and see {{, then:

push {{ onto the list of tokens
increment the column position by two characters
put us in state in_code and wait to see }}
scan the rest of the input”

Pretty cool how we can accomplish all four things in two lines of Erlang. As it were Erlang is pretty good language for scanning. Here are some scanning rules for dealing with code comments between {# and #}:

scan("{#" ++ T, Scanned, {Row, Column}, in_text) ->
    scan(T, Scanned, {Row, Column + 2}, {in_comment, "#}"});

scan([_ | T], Scanned, {Row, Column}, {in_comment, Closer}) ->
    scan(T, Scanned, {Row, Column + 1}, {in_comment, Closer});

scan("#}" ++ T, Scanned, {Row, Column}, {in_comment, "#}"}) ->
    scan(T, Scanned, {Row, Column + 2}, in_text);

And here are some more scanning rules just for dealing with the non-code parts of the template files.

scan("\n" ++ T, Scanned, {Row, Column}, in_text) ->
    scan(T, append_text_char(Scanned, {Row, Column}, $\n), {Row + 1, 1}, in_text);

scan([H | T], Scanned, {Row, Column}, in_text) ->
    scan(T, append_text_char(Scanned, {Row, Column}, H), {Row, Column + 1}, in_text);

They just increment the row and column positions. Note the use of append_text_char. The current character should be appended to the previous token, so we define a function to jump through the hoops for us:

append_text_char(Scanned, {Row, Column}, Char) ->
    case length(Scanned) of
        0 ->
            [{text, {Row, Column}, [Char]}];
        _ ->
            [Token | Scanned1] = Scanned,
            case element(1, Token) of
                text ->
                    [{text, element(2, Token), [Char | element(3, Token)]} | Scanned1];
                _ ->
                    [{text, element(2, Token), [Char]} | Scanned]
            end
    end.

The following rules deal with characters that have special meaning in Django Template Language:

scan("," ++ T, Scanned, {Row, Column}, {_, Closer}) ->
    scan(T, [{comma, {Row, Column}, ","} | Scanned], {Row, Column + 1}, {in_code, Closer});

scan("|" ++ T, Scanned, {Row, Column}, {_, Closer}) ->
    scan(T, [{pipe, {Row, Column}, "|"} | Scanned], {Row, Column + 1}, {in_code, Closer});

scan("=" ++ T, Scanned, {Row, Column}, {_, Closer}) ->
    scan(T, [{equal, {Row, Column}, "="} | Scanned], {Row, Column + 1}, {in_code, Closer});

scan(":" ++ T, Scanned, {Row, Column}, {_, Closer}) ->
    scan(T, [{colon, {Row, Column}, ":"} | Scanned], {Row, Column + 1}, {in_code, Closer});

scan("." ++ T, Scanned, {Row, Column}, {_, Closer}) ->
    scan(T, [{dot, {Row, Column}, "."} | Scanned], {Row, Column + 1}, {in_code, Closer});

A bit verbose, but whatever. That should give you the main idea for the rules. Finally when we reach the end of the input we need to reverse our list of tokens and mark any language keywords that we have run across.

scan([], Scanned, _, in_text) ->
    {ok, lists:reverse(lists:map(
                fun
                    ({identifier, Pos, String}) ->
                        RevString = lists:reverse(String),
                        Keywords = [ "if", "else", "endif", "not" ], %many others too
                        Type = case lists:member(RevString, Keywords) of
                            true ->
                                list_to_atom(RevString ++ "_keyword");
                            _ ->
                                identifier
                        end,
                        {Type, Pos, RevString};
                    ({Type, Pos, String}) ->
                        {Type, Pos, lists:reverse(String)}
                end, Scanned))};

…and we’re done. The scanner returns {ok, Tokens}. Time to send those tokens over to the parser.

Compiler Piece #2: The Parser

A parser takes a list of tokens and arranges them into a tree structure called a parse tree. You probably did something like this in your first CS class, where you take "1+1" and make a tree with "+" as the parent and "1" and "1" as the children. Same idea here.

We have some help in the form of a parser generator called Yecc. Yecc is like yacc but for Erlang. We’ll tell it how neighboring tokens are related to one another and Yecc will do the rest.

Our Yecc file looks like this. It has three parts: a list of non-terminals, a list of terminals, and a list of rules. Non-terminals are things you can expand into a list of terminals and non-terminals. Terminals cannot be expanded. Using the rules, we expand all the non-terminals into terminals.

Here’s a subset of our Django Template Language Yecc file (erlydtl_parser.yrl), just enough to deal with "if" statements.

Nonterminals
    Elements
    Literal
    Value
    Variable
    IfBlock
    IfBraced
    IfExpression
    ElseBraced
    EndIfBraced.

Terminals
    close_tag
    endif_keyword
    if_keyword
    else_keyword
    dot
    open_tag
    not_keyword
    text.

Rootsymbol
    Elements.

Here are the rules. The format is [thing to be expanded] -> [components of expansion] : [Erlang expression for recombining the components into a list or tree]. At the highest level, we just put things into a list:

Elements -> '$empty' : [].
Elements -> Elements text : '$1' ++ ['$2'].
Elements -> Elements IfBlock : '$1' ++ ['$2'].

Well that’s interesting. Say, what is in an IfBlock?

IfBlock -> IfBraced Elements ElseBraced Elements EndIfBraced : {ifelse, '$1', '$2', '$4'}.
IfBlock -> IfBraced Elements EndIfBraced : {'if', '$1', '$2'}.

Using the tuples, we build a tree. What do all these non-terminal expand to?

IfBraced -> open_tag if_keyword IfExpression close_tag : '$3'.
IfExpression -> not_keyword IfExpression : {'not', '$2'}.
IfExpression -> Value : '$1'.

Finally, some terminals. The rule for Value is a little complicated, and we won’t go into it here.

As you can see below, the part after the colon is strictly optional. In that case, nothing gets added to the parse tree.

ElseBraced -> open_tag else_keyword close_tag.
EndIfBraced -> open_tag endif_keyword close_tag.

We do need to compile the Yecc file into Erlang code. Yecc is built into erlc, so you can just run erlc -o src/ src/myfile.yrl and it should spit out an erl file.

In total the Yecc file for ErlyDTL weighs in at about 200 lines, roughly the same length as the scanner. 400 lines of code and we have 2/3 of the compiler done! Well kind of.

Compiler Piece #3: The Compiler

Now we have a parse tree. What we want to do is turn it into Erlang code. Actually, we will turn it into an abstract syntax tree (AST) with the help of the Erlang module called erl_syntax. Then we can call a special function to compile the AST into byte code. Building an AST instead of Erlang ASCII code makes our code feel cleaner and easier to debug, although at times it will seem verbose.

erl_syntax has a method corresponding to every syntactic structure in Erlang. If you want to build an atom, you call erl_syntax:atom/1. If you want to build a function call, you call erl_syntax:application/3. And so on. Check out the erl_syntax man page for details.

The compiler code can be found in erlydtl_beam_compiler.erl.

The ugliest part of the compiler is making an AST for compiled module’s skeleton. We need an AST for the -module definition, for the -export statement, and for the "render" function that each compiled template will have. In ErlyDTL this is done in a function called erlydtl_beam_compiler:forms/5 (source), which will probably make your head hurt. You can read it on your own time. The key part of it is the call to erl_syntax:revert, which converts the AST into a representation called "forms", which you then send straight to compile:forms/2.

The prettiest part of the ErlyDTL compiler is also the most subtle. It is the function called body_ast/3 (source), which looks at each node on the parse tree and converts it to the appropriate Erlang AST node, and then recursively does the same thing for each of the node’s children. As we descend the parse tree, we need to pass along some context, specifically all of the variables that are currently in scope. We might pick up a few scopes as we descend the parse tree, and shed them as we come back up.

Let’s take a look at a snippet of body_ast. All it does is takes the top level of a parse tree and decides what to do with each parent node. Here’s the bit that deals with "if" statements.

body_ast(DjangoParseTree, Context) ->
 lists:map(
        fun
%...
            ({'if', Variable, Contents}) ->
                ifelse_ast(Variable, body_ast(Contents, Context), empty_ast(), Context);
            ({'if', {'not', Variable}, Contents}) ->
                ifelse_ast(Variable, empty_ast(), body_ast(Contents, Context), Context);
            ({'ifelse', Variable, IfContents, ElseContents}) ->
                ifelse_ast(Variable, body_ast(IfContents, Context), 
                    body_ast(ElseContents, Context), Context);
%...
  end, DjangoParseTree).

Isn’t that code like a cool breeze in May? Well maybe not, but code like that is one of the simple pleasures in life. That’s the basic structure for a recursive descent compiler. The function processes each node and passes the child nodes back to body_ast. For the "if" statements here, it calls a function ifelse_ast that builds an AST for testing whether Variable evaluates to true and running the appropriate code block. If there’s no "else" block then it passes in an empty AST so that no code will be run if the variable is false. Here’s ifelse_ast:

ifelse_ast(Variable, IfContentsAst, ElseContentsAst, Context) ->
    % first resolve the variable to the right scope
    {Ast, VarName} = resolve_ifvariable_ast(Variable, Context),
    % now write an if statement
    erl_syntax:case_expr(erl_syntax:application(
        erl_syntax:atom(erlydtl_runtime), erl_syntax:atom(is_false), [Ast]),
            [erl_syntax:clause([erl_syntax:atom(true)], none,
                    [ElseContentsAst]),
                erl_syntax:clause([erl_syntax:underscore()], none,
                    [IfContentsAst])
        ]).

Well now, you should have the basic idea of the ErlyDTL compiler. The compiler piece weighs in around 750 lines. We cheat a little bit with some run-time code that compiled modules depend on, so all told our full-featured template compiler is about 2000 lines of code. Not something you can code in an afternoon, but not a huge undertaking, either.

Concluding Remarks

Erlang is almost certainly the server platform of the future, but it needs some creative ideas to make web programming for it more pleasant. Implementing template languages is a start, but why not new languages for the Model and Controller as well? They say Rails was possible due to Ruby’s great meta-programming features; if there is to be a Rails equivalent for Erlang, it will be possible due to the ease of writing a compiler for the Erlang platform.

REFERENCES

Erlang/OTP - platform of the future

Core Erlang - the intermediate language that template compilers probably should target

yecc man page - Erlang parser generator

erl_syntax man page - package for building an abstract syntax tree

ErlyDTL - Django Template Language for Erlang, written by Roberto Saccon and me

erlydtl_scanner.erl - scanner code
erlydtl_parser.yrl - parser Yecc file
erlydtl_beam_compiler.erl - compiler code

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Elixir RAM and the Template of Doom
The Joy of Erlang
Why I Program in Erlang

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!

Wizard
Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – LinkedIn – Twitter