neurostat-develop
[Top][All Lists]

## Re: [Neurostat-develop] API proposal

 From: Fabrice Rossi Subject: Re: [Neurostat-develop] API proposal Date: Fri, 22 Mar 2002 12:14:49 +0100

```Ok, I'm answering to myself, because we've just started yet another discussion
with Joseph outside the list and I think we are on important points, so back
to the list (and in "english").

Obviously, I've got my own idea of backpropagation, mostly based on my thesis.
This is not the unique point of view and as my API proposal is strongly biased
toward my way of seeing things, I guess some formulae are needed.

To simplify the discussion, I will hide the layered structured behind a
Directed Acyclic Graph (DAG) structure. The output of neuron j is denoted o^j.
We have o^j=T(v^j), where v^j=sum_{i before j}w_{ji}o^i+w_{j0}. The DAG is
hidden in the "before" predicate.

My way of backpropagating is to derivate with respect to o^j. Assume we have
an output function (generalisation of an error), i.e. a function from the
output neurons and let's denote it E. The most interesting derivatives are
dE/dw_{ji}. My way of calculating them is to write dE/dw_{ji} = dE/do^j
do^j/dw_{ji}.

The second derivative is a local one and we have obviously do^j/dw_{ji}=o^i
T'(v^j) (where o^0=1).

For the back propagation, we note that E depends on o^j only through o^k where
k is directly after j in the DAG. That is : dE/do^j=sum_{k after j} dE/do^k
do^k/do^j.

As in the previous formula, the second derivative is a local one, and we have
do^k/do^j=w_{kj}T'(v^k).

We have of course special cases. For input nodes, local derivatives use input
values rather than output of preceding nodes in the DAG. For output nodes,
back-propagated derivatives (i.e. dE/do^k) are in fact derivatives of the
output function.

My idea for the API is to separate calculation of back-propagated quantities
and calculation of actual derivatives (with respect to the weights for
instance). That is, the API should be used like this:

1) call to propagate, so as to obtain output, act and dact. act contains o^j
(output is a subvector of act) and dact contains T'(v^j)

2) call to an output function so as to obtain E and also dE/do^k to output
neurons (this is the error matrix needed by back_propagate)

3) call to back_propagate so as to obtain jacobian, which contains dE/do^k for
all neurons. If you look at the formula, you see that weights are needed, as
well as dact (but not act)

4) call to weights_derivative in order to calculate dE/dw_{ji} (w_jacobian).
Formulae show that you need jacobian but also dact and act.

Alternatively, you can call inputs_derivatives which is basically based on
similar formulae but restricted to input neurons. act is not needed for this
calculation.

There is another way of doing back-propagation (Joseph's way), which is
specific to MLP neurons (which are clearly our first targets). Well, it's not
exactly specific to this kind of neuron, but I think it works well only on
this kind of neuron. The idea is to derivate with respect to v^j rather than
o^j. We have therefore :

dE/dw_{ji} = dE/dv^j dv^j/dw_{ji}
dv^j/dw_{ji} =o^i (again with o^0=1)
dE/dv^j = sum_{k after j} dE/dv^k dv^k/dv^j
dv^k/dv^j = w_{kj}T'(v^j)

Derivatives with respect to the pre-output of output neurons are equal to the
derivatives of the output function when the last layer uses linear activation
function (T). When it is not the case, additionnal calculations are needed.

Now, we need to compare both methods. Let us consider for instance a 2 layers
perceptron with a linear activation function in the last layer. As this is a
very common case, I think we will have an optimized implementation in this
case, avoiding unnecessary multiplications. In this case, for output neurons,
we have dE/dv^j=dE/do^j and this is a local derivative. So we only need to
back-propagate to the first layer.

With the first method, we calculate dE/do^j=sum_{k in output layer} dE/do^k
w_{kj}T'(v^k). But this formula can be simplified (and in fact it MUST be)
into dE/do^j=sum_{k in output layer} dE/do^k w_{kj}, because T is here the
activation function of the SECOND layer. Then, we write dE/dw_{ji} = dE/do^j
do^j/dw_{ji}=dE/do^j o^i T'(v^j) (again, T' is removed for the second layer).

With the second method, we calculate dE/dv^j = sum_{k in output layer} dE/dv^k
w_{kj}T'(v^j). The main difference between this formula and the previous one
is that T is here the activation function of the FIRST layer and therefore
cannot be remove from the calculation. Then, we write dE/dw_{ji} = dE/dv^j
dv^j/dw_{ji}=dE/dv^j o^i.

For the second layer derivatives with respect to the weights, formula behaves
the same. For the first layer is slightly more complex. In fact, efficiency of
the calculation is strongly related to how the actual formulae are
implemented. With the second method, we have additionnal multiplications for
each dE/dv^j (compared to dE/do^j). We have exactly one additionnal
multiplication for each first layer neuron, because T'(v^j) can be factorized
outside the sum. For the first method, we have additionnal multiplication for
each do^j/dw_{ji} (compared to dv^j/dw_{ji}). Again, we can factorized the
calculation to have exactly one additionnal multiplication for each first
layer neuron but at the expense of additionnal memory space (to store dE/do^j
T'(v^j) for each j in the first layer).

Basically, I think that this situation is the same for multiple layers (ok, I
admit I'm currently to lazy to perform the actual comparison). So the second
approach is better than the first one because it needs less storage. We still
have the problem of derivative with respect to the inputs, but in fact it is
quite obvious that we just have to calculate dv^j/dx_k for input neurons and
this derivative is equal to w_{jk}, so, no problem at all.

This has a small impact on the API. weights_derivative and inputs_derivatives
do not need dact anymore. As a side note, I just discovered that both methods
need input (it was also the case for the first way of doing the back-prop).

Ok, I'm not sure everything is clear and error proof, so...