Writing a template language with clojure
Mustache is a very sweet template system inspired by ctemplate and el. Ever wondered how mustache is implemented or how to write your own template system? this post will teach you the basics.
NOTE: although this post discusses abstract syntax trees and lexical analysis, don’t be intimidated, no experience other than a basic knowledge of Clojure and regular expressions is needed to read, understand and be able to implement your own template language.
A simple example of how our templates will look like:
;; First we define the template
(def template
"<p> Hello {{name}}, your friends are: </p>
<ul>
{{#friends}}
<li>{{first-name}} {{last-name}} </li>
{{/friends}}
</ul>")
;; Now we render the template
(render template
{:name "Bob"
:friends [{:first-name "Bill" :last-name "Gates"
:first-name "Chuck" :last-name "Cheese"}]})
;; Output should be something like:
"<p> Hello Bob, your friends are: </p>
<ul>
<li>Bill Gates</li>
<li>Chuck Cheese</li>
</ul>"Get the code
This blog post is based on gabo, I suggest you take a look at the README, clone the project or at least play with it a little using the repl. Code is also available in clojars.
Introduction
Rendering templates is a 3 step process:
Step 1: Lexical Analysis
The purpose of the lexical analysis step is to convert a template string into a list of tokens.
Step 2: Parsing
The second step, parsing, is in charge of converting the list of tokens produced by the lexical analysis step into an abstract syntax tree.
Step 3: Rendering
The final step takes as input an abstract syntax tree and some input and renders the given input using the template as structure.
Now that we have a high level understanding of the steps required to compile and render templates let’s get into detail.
Step 1: Lexical Analysis
The main purpose of the lexical analysis step is to implement the
tokenize function which takes a template as input and converts it into
a list of tokens. A token is a very simple data structure (usually a
tuple) that splits the template into meaningful lexical elements. A
typical programming language will have string tokens, reserved keyword
tokens, number tokens, etc.
Let’s see with an example how the tokenize function works.
Example 1: tokenizing a simple template
user=> (use 'gabo.lexer)
user=> (tokenize "Hi {{name}}")
[[:literal "Hi "] [:symbol "name"]]Example 2: tokenizing a more complex template
user=> (tokenize "{{#people}}Name: {{name}}, age: {{age}}{{/people}}")
[[:iter-init "people" :default]
[:literal "Name: "]
[:symbol "name"]
[:literal ", age: "]
[:symbol "age"]
[:iter-end "people"]]Note: for now let us ignore the :default keyword present in the first
token, we will get to that later.
Tokens and their syntax:
All tokens are vectors of 2 or more elements. The first one is called the token identifier, the second one corresponds to the token’s value.
There are 4 types of tokens:
:symbol: Symbols use the following syntax{{something}}.:iter-init: Mark the beginning of an iteration, they have the following syntax:{{#something}}.:iter-end: they mark the termination of an iteration as started by an :iter-init token. They look like{{/something}}:literal: Literals match anything that is not asymbol,iter-initoriter-end
Tokenization
Tokenization is the process of converting a string into a list of
tokens, the tokenization procedure is implemented using two functions:
tokenize and tokenize-chunk:
The tokenize function
(defn tokenize
[template]
(loop [remaining template
tokens []]
(if (empty? remaining)
tokens
(let [[str-match token] (tokenize-chunk remaining)]
(recur (.substring remaining (count str-match))
(conj tokens token))))))The tokenize function is a very simple one, it takes as input a
template string and loops through the template until it reaches the end.
On every iteration the tokenize-chunk function is called which tries
to match the remaining template to a token, when a match is found, the
tokens vector will conjoined with the new token and the remaining
template will be substringed by the matched string.
As you probably guessed, the tokenize-chunk is the function that is
doing the actual heavy work of matching strings to tokens.
(defn tokenize-chunk
"Given a template substring returns a tuple of [match token]
where match is the exact string that matched and token is
the token that corresponds to the regex match.
Example: (tokenize-chunk '{{bar}} foo') will return
['{{bar}}' [:symbol 'bar']]"
[template]
(when-match template
;; match symbols e.g. {{foo}}
[sym #"\A\{\{\s*([\w-\.]+)\s*\}\}"]
[ (first sym) [:symbol (last sym) ] ]
;; match iter-end e.g. {{/foo}}
[iter-end #"\A\{\{/\s*(\w+)\s*\}\}"]
[ (first iter-end) [:iter-end (last iter-end) ]]
;; match iter-init with no args e.g. {{#foo}}
[iter-init #"\A\{\{#\s*(\w+)\s*\}\}"]
[ (first iter-init) [:iter-init (last iter-init) :default ]]
;; match iter-init with args e.g. {{#foo 'separator'}}
[iter-init #"\A\{\{#\s*(\w+)\s+'([\s\S]*?)'\s*\}\}"]
[(first iter-init) (cons :iter-init (rest iter-init))]
;; match literals
[literal #"\A([\s\S][\s\S]*?)\{\{"]
[(last literal) [:literal (last literal)]]
;; more literals
[literal #"\A[\s\S]*"]
[literal [:literal literal]]))when-match, what is that?
You probably noticed the use of the when-match macro. The implementation
for the when-match macro can be found
here.
Example usage of when-match:
user=> (when-match "a long string"
#_=> [match #"long"] (keyword match)
#_=> [wont-match #"string"] (keyword wont-match))
:longwhen-match takes a string as its first argument followed by n pairs of forms.
The macro will try to match the regular expression in the first form. If a
match is found then the symbol preceding the regular expression will be
let and the second form will be evaluated. Regular expression matches
use re-find under the hood.
If there is no match then the same will be attempted with the next pair
of forms. I suggest you play around with when-match in your REPL to
get a solid understanding of how it works.
Going back to the tokenize-chunk function: Implementation is very
simple once you understand when-match, each regex matches to one type
of token, when a token is matched a pair of [string token] will be
returned where string is the actual string that was matched and token
is the corresponding token constructed from the matched string.
Example: tokenize-chunk at work.
user=> (tokenize-chunk "Hi {{name}}, how are you?")
["Hi " [:literal "Hi "]]
user=> (tokenize-chunk "{{name}}, how are you?")
["{{name}}" [:symbol "name"]]
user=> (tokenize-chunk ", how are you?")
[", how are you?" [:literal ", how are you?"]]Notice how this example tokenizes the "Hi {{name}}, how are
you?" by matching chunk by chunk to a token.
We are done with the lexical analysis step. Let’s go now to the parsing step.
Step 2: Parsing
Let us recall that the purpose of this step is to produce an abstract syntax tree or AST. Think of an AST as a tree like representation of our template which we can later use to actually render templates.
Consider a template which renders a country and every city in the country. We will first tokenize it and then manually convert the tokens into an AST.
(def template "Countries with their cities:
{{#countries '\n'}}
Country: {{name}}
Cities:
{{#cities '\n'}}
* {{name}}
{{/cities}}
{{/countries}}")Tokenization is quite simple using our tokenize function
user=> (tokenize template)
[[:literal "Countries with their cities:\n"]
[:iter-init "countries" "\n"]
[:literal "\nCountry: "]
[:symbol "name"]
[:literal "\nCities:\n"]
[:iter-init "cities" "\n"]
[:literal " * "]
[:symbol "name"]
[:iter-end "cities"]
[:literal "\n"]
[:iter-end "countries"]]We will now convert those tokens into a tree. The conversion will use these rules:
:literalor:symboltokens will be left untouched:iter-initand:iter-endtokens will be grouped into an:iternode and all tokens between the init/end pair will be parsed recursively.
Example of an AST as produced by the parse function
user=> (parse template)
[[:literal "Countries with their cities:\n"]
[:iter "countries" "\n" [[:literal "\nCountry: "]
[:symbol "name"]
[:literal "\nCities:\n"]
[:iter "cities" "\n" [[:literal "\n * "]
[:symbol "name"]
[:literal "\n"]]]
[:literal "\n"]]]]Let’s now take a look at the parse function:
(defn parse
[string]
(-> (tokenize string)
build-ast))It doesn’t do much, let’s take a look at the build-ast which takes a
list of tokens as input and outputs an AST.
(defn- build-ast
"Builds an abstract syntax tree given a list of tokens as produced by tokenize"
[tokens]
(loop [tokens tokens
ast []]
(if (empty? tokens)
ast
(let [token (first tokens)]
(cond
(or (is-literal token) (is-symbol token))
(recur (rest tokens)
(conj ast token))
(is-iter-init token)
(let [sub-list (find-iter-sub-list tokens)
[_ token-val separator] token]
(recur (drop (+ 2 (count sub-list)) tokens)
(conj ast [:iter token-val
separator
(build-ast sub-list)])))
:else
(throw (unexpected-token-exception token)))))))The build-ast function is a bit complex, but let’s break it down into
a series of steps:
- It
loops through all the tokens; theastwhich will be the result of the function. - The
ifin line 6 simply asserts that when there are no tokens left, we are done so just return theast. - If there are tokens left then line 8
letstokento the first token in the iteration. - The
condhas 2 cases and anelse- Case 1: the “simple” case, if the token is a literal or symbol
simply add token to the
ast. - Case 2: the “hard” case is evaled when an
:iter-inittoken is found.
Thefind-iter-sub-listfunction will return all tokens inside the:iter-initand:iter-endpair. Therecurcall will drop all tokens found in thesub-list+ 2 which corresponds to the:iter-initand:iter-endtokens. Theastis thenconjoined to a recursive call of thebuild-astprocedure over thesub-listof tokens. elsecase: if something other than an:iter-init,:symbolor:literalis found, then throw an exception because we found an unexpected token.
- Case 1: the “simple” case, if the token is a literal or symbol
simply add token to the
This finishes our explanation of the parsing step. I have tried to write the simplest possible parser, hopefully you think so too. A more complex one would, for example, include better error detection mechanisms.
Step 3: Rendering
We have arrived to the final step in our template rendering system. This
system builds upon steps 1 and 2 and takes an AST and a context as
arguments and returns a rendered template. A context is simply a map
where every key is a keyword and vals are numbers, strings or
collections of contexts.
;; The {{.}} refers to the item that is being iterated.
user=> (def tokens (parse "{{#numbers}}Number:{{.}}{{/numbers}}"))
user=> tokens
user=> [[:iter "numbers" :default [[:literal "Number:"]
[:symbol "."]]]]
user=> (eval-tree tokens {:numbers [1 2 3 4 5]})
"Number:1,Number:2,Number:3,Number:4,Number:5"Let us now look at the implementation of eval-tree:
(defn eval-tree
"Evaluates a compiled template as a tree with the given context"
[tree ctx]
(cond (is-literal tree)
(second tree)
(is-symbol tree)
(let [[_ token-val] tree]
(if (= token-val ".")
(str ctx)
(get ctx (keyword token-val) "")))
(is-iter tree)
(let [[_ token-val separator sub-tree] tree
coll (get ctx (keyword token-val) [])]
(->> (map #(eval-tree sub-tree %) coll)
(interpose (if (= :default separator) "," separator))
(apply str)))
;; Only executed once: the first time eval-tree is called, no subsequent
;; recursive call will go through this branch.
(coll? tree)
(apply str (map #(eval-tree % ctx) tree))))eval-tree runs through each node in the AST recursively as follows:
- If a
:literaltoken is found, return the token’s value. - If a
:symbolis found then check if its value is"."- Case
token-valis".": return the complete context as a string. - Case
token-valis not".": look fortoken-valas a keyword inside the context map and return that.
- Case
- If an
:itertree is found then:letcollto to the collection inside the context that matches the:iter’stoken-val.- Recursively call
eval-treeon every node in thesub-tree. - Interpose the previous result with the separator which is by default
','. - Finally concatenate all “sub-templates” and separators.
- The final branch
(coll? tree)is actually only evaled once, the first time thateval-treeis called as ASTs produced byparseare not actually trees, they are a list of trees. This step is equivalent to the previous branch sans interposing with separators.
Wrapping things up
Now that we have implemented tokenize, parse and eval-tree. The
render function is very simple:
(defn render
"Compiles and evaluates the template with the given context"
[template ctx]
(eval-tree (parse template) ctx))Hooray!
I hope you now have a good understanding on how to write your own
template system. If you have any questions, suggestions or just want to
chat, you can e-mail me at fernandohur at gmail.com.
Criticism is always welcome but go easy on me, it’s my first post ;)