-
Notifications
You must be signed in to change notification settings - Fork 63
Understanding the algorithm
core.match employs the algorithm described in Luc Maranget's paper [Compiling Pattern Matching to good Decision Trees]((http://www.cs.tufts.edu/~nr/cs257/archive/luc-maranget/jun08.pdf). What follows is a slower paced description of the algorithm.
Consider the following pattern match:
(let [x true
y true
z true]
(match [x y z]
[_ false true] 1
[false true _ ] 2
[_ _ false] 3
[_ _ true] 4
:else 5))
It's clear that the above expression will return 4
, as the values
match the fourth clause.
As a visual aid we will show the pattern matrix as it goes through a series of transformations in the following form:
Fig. 1
x y z
-------
[_ f t] 1
[f t _] 2
[_ _ f] 3
[_ _ t] 4
[_ _ _] 5
Note that we've replaced true
with t
and false
with f
. Note
that the final line is represented as a line of wildcards. Finally we
label the columns with the variables (in this context we will use the
word "occurrence") and the rows.
Maranget's algorithm is based on a surprisingly simple insight. The semantics of pattern matching dictate that we evaluate the clauses one at a time from top to bottom. Each clause itself is evaluated left to right. Even so this gives us a surprising amount of leeway because of wildcards.
Observe that for the first line to succeed we need never consider
x
. But this is an important observation! Maranget's algorithm
approach produces optimal decisions trees by finding the columns that
must be tested and testing them first.
So how do we score a column? Wilcards obviously shouldn't add anything to a column's score. But in addition real patterns below a wildcard should also not add to a column's score. Given these simple rules the matrix when changed to represent the scoring process looks something like this:
Fig. 2
x y z
-------
[0 1 1] 1
[0 1 0] 2
[0 0 0] 3
[0 0 0] 4
[0 0 0] 5
It's now immediately clear that the y
column should be tested
first. Maranget's algorithm modifies the pattern matrix by swapping
the column that should be tested so that it is the new first column,
giving us:
Fig. 3
y x z
-------
[f _ t] 1
[t f _] 2
[_ _ f] 3
[_ _ t] 4
[_ _ _] 5
At this point we now can generate the first switch of the decision tree. From this step we will compute two new pattern matrices - one will represent a specialized matrix - the matrix that remains after we've made successful match, and the default matrix - the matrix if we don't find a match.
The specialized matrix will look like the following:
Fig. 4
x z
-----
[_ t] 1
And the default matrix will look like the following:
Fig. 5
x z
-----
[f _] 2
[_ f] 3
[_ t] 4
[_ _] 5
At this point it should hopefully be clear that we can now recursively apply the above process of selecting the necessary column, and once again splitting between the specialized matrix and the default matrix.
For completeness we illustrate the remaining steps for Fig. 4
. We
pick the necessary column.
Fig. 6
z x
-----
[t _] 1
When we specialize this matrix on t
we'll be left with:
Fig. 7
x
-----
[_] 1
Note that there is not default matrix and we are left with a pattern
matrix that only has one row and it is a row of wildcards - we have
arrived at a terminal node of the decisions tree and should return
1
.
Every time we specialize a pattern matrix we will produce a
SwitchNode
, this node will eventually be compile down to a test and
will dispatch either to the results of the specialized matrix or the
default matrix. If we get to case illustrated in Fig. 7
we will
produce a LeafNode
.
It is possible that we may arrive at a matrix which has no
rows. This means no match exists and for this case we will emit a
FailNode
which will be compiled to an exception when we generate
source.
The decision tree is a directed acyclic graph produced by the
described algorithm and it will be composed of SwitchNode
s,
LeafNode
s, and FailNode
s.
As Maranget shows in his paper - the simple idea of picking the necessary column produces compact decision trees.
Thus far we have only demonstrated matching on literals. Pattern matching derives most of it power and benefit when applied to data structures. Consider the following match expression where we match on a Clojure vector:
(let [v [1 2 4]
x true]
(match [v x]
[[_ 2 2] true] 1
[[_ 2 3] false] 2
[[1 2 4] _] 3
:else 4))
This can be represented with the following matrix:
Fig. 8
v x
-----------
[[_ 2 2] t] 1
[[_ 2 3] f] 2
[[1 2 4] _] 3
[ _ _] 4
How can we match the contents of the vector? We simply need to transform the matrix into a form where Maranget's algorithm can it's work. When we specialize in this case, we are not specializing on a particular value, but whether the value to be tested is a vector of a certain size.
v
in this case is already the necessary column so no swapping is
required. The specialized matrix will look like this:
Fig. 9
v0 v1 v2 x
------------
[_ 2 2 t] 1
[_ 2 3 f] 2
[1 2 4 _] 3
And the default matrix will look lie:
Fig. 10
v x
-----
[_ _] 4
So we can see that for complex patterns we simply try to convert the
pattern matrix into a form such that Maranget's approach again
applies. A quick glance at Fig. 9
shows that v1
will be the next
necessary column.