Writing a template language with clojure
Mustache is a very sweet template system inspired by ctemplate and el. Ever wondered how mustache is implemented or how to write your own template system? this post will teach you the basics.
NOTE: although this post discusses abstract syntax trees and lexical analysis, don’t be intimidated, no experience other than a basic knowledge of Clojure and regular expressions is needed to read, understand and be able to implement your own template language.
A simple example of how our templates will look like:
Get the code
This blog post is based on gabo, I suggest you take a look at the README, clone the project or at least play with it a little using the repl. Code is also available in clojars.
Introduction
Rendering templates is a 3 step process:
Step 1: Lexical Analysis
The purpose of the lexical analysis step is to convert a template string into a list of tokens.
Step 2: Parsing
The second step, parsing, is in charge of converting the list of tokens produced by the lexical analysis step into an abstract syntax tree.
Step 3: Rendering
The final step takes as input an abstract syntax tree and some input and renders the given input using the template as structure.
Now that we have a high level understanding of the steps required to compile and render templates let’s get into detail.
Step 1: Lexical Analysis
The main purpose of the lexical analysis step is to implement the
tokenize
function which takes a template as input and converts it into
a list of tokens. A token is a very simple data structure (usually a
tuple) that splits the template into meaningful lexical elements. A
typical programming language will have string tokens, reserved keyword
tokens, number tokens, etc.
Let’s see with an example how the tokenize
function works.
Example 1: tokenizing a simple template
Example 2: tokenizing a more complex template
Note: for now let us ignore the :default
keyword present in the first
token, we will get to that later.
Tokens and their syntax:
All tokens are vectors of 2 or more elements. The first one is called the token identifier, the second one corresponds to the token’s value.
There are 4 types of tokens:
:symbol
: Symbols use the following syntax{{something}}
.:iter-init
: Mark the beginning of an iteration, they have the following syntax:{{#something}}
.:iter-end
: they mark the termination of an iteration as started by an :iter-init token. They look like{{/something}}
:literal
: Literals match anything that is not asymbol
,iter-init
oriter-end
Tokenization
Tokenization is the process of converting a string into a list of
tokens, the tokenization procedure is implemented using two functions:
tokenize
and tokenize-chunk
:
The tokenize function
The tokenize
function is a very simple one, it takes as input a
template string and loops through the template until it reaches the end.
On every iteration the tokenize-chunk
function is called which tries
to match the remaining template to a token, when a match is found, the
tokens
vector will conj
oined with the new token and the remaining
template will be substring
ed by the matched string.
As you probably guessed, the tokenize-chunk
is the function that is
doing the actual heavy work of matching strings to tokens.
when-match
, what is that?
You probably noticed the use of the when-match
macro. The implementation
for the when-match
macro can be found
here.
Example usage of when-match
:
when-match
takes a string as its first argument followed by n pairs of forms.
The macro will try to match the regular expression in the first form. If a
match is found then the symbol preceding the regular expression will be
let
and the second form will be evaluated. Regular expression matches
use re-find
under the hood.
If there is no match then the same will be attempted with the next pair
of forms. I suggest you play around with when-match
in your REPL
to
get a solid understanding of how it works.
Going back to the tokenize-chunk
function: Implementation is very
simple once you understand when-match
, each regex matches to one type
of token, when a token is matched a pair of [string token]
will be
returned where string
is the actual string that was matched and token
is the corresponding token constructed from the matched string.
Example: tokenize-chunk
at work.
Notice how this example tokenizes the "Hi {{name}}, how are
you?"
by matching chunk by chunk to a token.
We are done with the lexical analysis step. Let’s go now to the parsing step.
Step 2: Parsing
Let us recall that the purpose of this step is to produce an abstract syntax tree or AST. Think of an AST as a tree like representation of our template which we can later use to actually render templates.
Consider a template which renders a country and every city in the country. We will first tokenize it and then manually convert the tokens into an AST.
Tokenization is quite simple using our tokenize
function
We will now convert those tokens into a tree. The conversion will use these rules:
:literal
or:symbol
tokens will be left untouched:iter-init
and:iter-end
tokens will be grouped into an:iter
node and all tokens between the init/end pair will be parsed recursively.
Example of an AST as produced by the parse
function
Let’s now take a look at the parse function:
It doesn’t do much, let’s take a look at the build-ast
which takes a
list of tokens as input and outputs an AST.
The build-ast
function is a bit complex, but let’s break it down into
a series of steps:
- It
loop
s through all the tokens; theast
which will be the result of the function. - The
if
in line 6 simply asserts that when there are no tokens left, we are done so just return theast
. - If there are tokens left then line 8
let
stoken
to the first token in the iteration. - The
cond
has 2 cases and anelse
- Case 1: the “simple” case, if the token is a literal or symbol
simply add token to the
ast
. - Case 2: the “hard” case is evaled when an
:iter-init
token is found.
Thefind-iter-sub-list
function will return all tokens inside the:iter-init
and:iter-end
pair. Therecur
call will drop all tokens found in thesub-list
+ 2 which corresponds to the:iter-init
and:iter-end
tokens. Theast
is thenconj
oined to a recursive call of thebuild-ast
procedure over thesub-list
of tokens. else
case: if something other than an:iter-init
,:symbol
or:literal
is found, then throw an exception because we found an unexpected token.
- Case 1: the “simple” case, if the token is a literal or symbol
simply add token to the
This finishes our explanation of the parsing step. I have tried to write the simplest possible parser, hopefully you think so too. A more complex one would, for example, include better error detection mechanisms.
Step 3: Rendering
We have arrived to the final step in our template rendering system. This
system builds upon steps 1 and 2 and takes an AST and a context as
arguments and returns a render
ed template. A context is simply a map
where every key is a keyword
and vals
are numbers, strings or
collections of contexts.
Let us now look at the implementation of eval-tree
:
eval-tree
runs through each node in the AST recursively as follows:
- If a
:literal
token is found, return the token’s value. - If a
:symbol
is found then check if its value is"."
- Case
token-val
is"."
: return the complete context as a string. - Case
token-val
is not"."
: look fortoken-val
as a keyword inside the context map and return that.
- Case
- If an
:iter
tree is found then:let
coll
to to the collection inside the context that matches the:iter
’stoken-val
.- Recursively call
eval-tree
on every node in thesub-tree
. - Interpose the previous result with the separator which is by default
','
. - Finally concatenate all “sub-templates” and separators.
- The final branch
(coll? tree)
is actually only evaled once, the first time thateval-tree
is called as ASTs produced byparse
are not actually trees, they are a list of trees. This step is equivalent to the previous branch sans interposing with separators.
Wrapping things up
Now that we have implemented tokenize
, parse
and eval-tree
. The
render
function is very simple:
Hooray!
I hope you now have a good understanding on how to write your own
template system. If you have any questions, suggestions or just want to
chat, you can e-mail me at fernandohur
at gmail.com
.
Criticism is always welcome but go easy on me, it’s my first post ;)