I’ve had a serendipitous encounter on the plane with a professor returning from SC17, and we ended up talking about how backprop can map onto hardware.
This motivated me to write up a simple example of backprop which captures its similarity to systolic array architecture.
Suppose you have a multi-layer architecture with n=3 layers, operating on vectors with m dimensions. Your predictions may be written as follows:
We feed predictions Y into some loss function l to compute the objective to optimize. To save on brackets, write the objective L using function composition operator:
Using chain rule, the derivative of the objective can be written in terms of matrix multiplication as follows:
Each term in this chain product is the Jacobian matrix. To make things concrete, suppose we have the following loss function l:
our layer function f simply squares individual components:
Their Jacobians can be written as follows:
To compute derivative of the objective, we multiply these Jacobian matrices together. Our chain matrix product can be visualized as below:
The first thing to note is that order in which you compute this chain matrix product matters for efficiency. Finding the most efficient order is known as the Matrix Chain Ordering Problem. In the case above, multiplying left to right is efficient.
The second thing to note is that you can compute these vector matrix products without ever constructing the Jacobian matrix explicitly. This is important because m (size of your representation) can be on the order of tens of thousands, which means that Jacobian could be billions of entries. In our case, because the Jacobian matrix is diagonal, vector-Jacobian product is equivalent to pointwise multiplication of vectors, hence we can avoid constructing the full m-x-m Jacobian matrix
A typical layer function used in neural network will admit similarly efficient implementation of this operation. This vector-Jacobian product operation is the key of any backprop implementation. Theano calls it “Lop” (left operator), in PyTorch it’s the “backward” method, TensorFlow calls it “grad” or “grad_func”.
To simplify, label intermediate values produced by our computation, aka activations, as “A”:
Using this, we can write our target derivative as follows:
Note that our loss Jacobian simply transposes its input, so we get:
Lets define our first backward value as
Now we can write our computation as a sequence of steps
Each transformation in this series of steps is a backprop step. We can write down this series of steps formally by adding some notation: let b denote our vector-Jacobian product (aka backwards(), Left operator, grad_func), using Hadamard product symbol to denote pointwise multiplication
We can finally write our forward/backward equations explicitly:
The resulting computation graph now can now be visualized:
Redraw this graph more compactly:
If you look up diagram of 2D systolic array, ie here, you’ll see that this kind of hardware architecture is a perfect fit