[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Neurostat-develop] first ideas...
Re: [Neurostat-develop] first ideas...
Mon, 17 Dec 2001 15:10:08 +0100
Joseph Rynkiewicz wrote:
> >I agree that the speed up is not so important, but still for massive
> >such as genetic optimization of the network architecture, even 20% speed
> >is important. Nevertheless, I don't think there is problem to work both
> with a
> >dense storage and with a spare one, mainly thanks to "object" approach.
> I have used this approach in my C++ program (hence, with a "object"
> I have done a object layer with sparse representation and dense
> The dense representation was used for the propagation (and the
> back-propagation) if the proportion of zeros in matrix was less than 20%.
> I can say that 99% of vicious bugs where cause by this ambiguous
> Namely, the BIG problem is the synchronization of the both representation.
> If you update the weight of the MLP, you have to copy this new parameters
> in the dense representation.
> (for the sparse representation this is automatic since it point directly
> to the weights).
> I know that it seems obvious to never forget to call a synchronization
> routines as soon as you have to update the weights, but one day in a
> complex optimization algorithm in a mixture of experts models, you will
> forget to do that, and you will spent one month to retrieve the bug...
> Because I have already used this optimization "technique", I really think
> that it is a not so small overhead, for a very small gain.
> To enforce my opinion I will cite God :
> "We should forget about small efficiencies, say about 97% of the time:
> premature optimization is the root of all evil." -- Donald Knuth
> At least we can keep for later this double representation.
> (Yes I know I seem to have an obsession of sparse matrix, but I have
> already done a prunable MLP ; it is really not so obvious to manage the
> holes in the architecture and I think that the sparse notation like
> compressed row or column storage help a lot.)
My post was not very clear, so I rephrase. I completely agree with most of
what you've answered, basically because I don't think having both
representation AT THE SAME TIME is a good idea. My idea was simply to assume
that when the network is dense, it uses a dense representation, whereas it
uses a sparse representation when it is not dense. The idea is simply to have
an union (or something similar). At the beginning of the (back)propagation
function, you switch on the type of the MLP to call a sparse calculation or a
dense one. You have only one representation at a given time. When you prune
the network, the representation can change and you NEVER use a double
representation. My idea is that pruning is useful but soft pruning
(regularization) is also useful. So I think we should provide room for
optimized spare MLP as well as for optimized dense MLP.
> >Ok, so here is how I see a possible header:
> >int propagate(MLP *m,double *param,int param_size,double *input,int
> >in_size,double *result,int out_size)
> >The return value is an error code or something similar (maybe it is not
> >needed). (double *,int) pairs are "vectors". Everything else is embedded
> >the MLP struct. It contains of course a description of the architecture,
> >is basically a way to interpret the parameter vector (i.e. param), which
> >contain dense matrices and vectors, or sparse ones. We also embed in the
> >struct all the needed cache (pre-output of each layer, etc.).
> >I don't know if we really need to have all the sizes because they are
> >in the MLP struct. But it allows at least to test for adequacy. Maybe it
> >not needed in the low level API.
> I think that we can embed the param in the MLP too.
> Essentially because a lot of decision for this parameter vector are made
> insight the MLP.
> For example, when you decide to prune one weight, you have to verify the
> architecture of the MLP (to seek unused hidden units) and recalculate the
> architecture, so the parameter dimension. Then, you have to reconstruct a
> new parameter vector, because some coefficients are maybe become useless.
> This interaction between the architecture and the parameter advocate for
> the integration of the parameter inside of the MLP.
I don't think so. I completely agree that for MLP related functions, there is
a strong interaction between architecture and weights. But this is not the
case during gradient descent. If we want to be able to use normal
multidimensionnal minimizers for training, I think we should keep things
separated. Indeed, the classical way to represent a function in C is to have a
struct with an eval function pointer and a void * params. The evaluation type
is something like this:
double (* f) (const double * x, const int n,void * params)
The void * is a placeholder for any parameters needed by the function. The
rationnal of this kind of representation is that the function does not deal
with memory related issues. It is submitted an input vector (i.e., const
double * x, const int n) and returns back a double value.
The easiest way to translate a MLP into this kind of representation is to use
the MLP struct (as well as the training data and the error function) as the
parameters. The input vector (which is at this point the parameter vector of
the MLP) is not included into the params. If you put the parameter vector
inside the MLP, you run into endless memory management problems. When do you
decide to trust or not the pointer? If it has been submitted by a minimization
algorithm, how long does it remain valid? If you cannot keep it (because it
might be freed by its owner), what is the point in storing it inside the MLP?
I think that the training of an MLP should use the following algorithm:
1) create a MLP and a initial weight vector w
2) reduce the modelling error thanks to a gradient descent algorithm starting
3) modify the MLP architecture using w_opt the result of the gradient descent
-> you obtain a new w (possibly smaller, maybe sparse)
4) go back to 2
During step 2, you don't care about the MLP architecture, you don't even know
you are working with a MLP. And the gradient descent algorithm is doing
whatever is needed to w (allocation, freeing, etc.).
At the end of step 2, you end up with a new optimized parameter vector, which
can be used by step 3. There is no reason for step 3 to keep this vector. It
can be replaced by a sparse one, a smaller one, etc.
I'm not saying that I don't want to use specifically designed MLP training
algorithms (in fact I don't, but I don't want to stop other people using such
things), but I don't see any problem with separation between architecture and
numerical parameters, whereas I do see problems with mixed representation.
> Finally the sizes can be keeped, and we can test the adequacy of the size
> for debugage purpose.
> (with #ifdef DEBUG ....#endif).
Right, but I still wonder if it's needed. I mean that sizes are already
specified in the MLP struct...