help-gsl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-gsl] gsl performance


From: onefire
Subject: Re: [Help-gsl] gsl performance
Date: Tue, 8 Oct 2013 17:27:10 -0400

I know that its all "just RAM". What I wanted to say (but did not) was that
I've seen code become much faster when stuff is allocated on the stack but
that is because compilers can sometimes do better "magic" in such cases.

I did measure running time with a loop that reuses a minimizer and the
results were nothing to write home about. I also tried to reuse vectors of
guesses and step sizes and it makes very little difference. I modified
gsl's code to have a static array to act as a buffer of storage for the
internal objects (I did this in a hurry and probably made mistakes) and got
very small improvements.
Thus I think that allocation is not the main issue (though my library
becomes very slow when I use gsl_vectors and I have no idea why...).

However I did make some progress! gprof told me that my program was
spending almost 50% of its time on nmsimplex_size, the function that
computes the "size" of the simplex (a measure of how close we are to the
solution, this is done in every iteration). It also seemed that a lot of
the time was spent on gsl_blas_dnrm2 and gsl_blas_daxpy. These are fairly
simple blas routines and I did try three different implementations
(gslcblas, mkl and openblas), so how could they be the problem? Simple:
They were barely doing any work and they were not inlined.
For example, I was minimizing a function of three variables. As far as I
understand the call to gsl_blas_daxpy would do something like:
y[0] += -x[0];
y[1] += -x[1];
y[2] += -x[2];

I don't see how a cblas implementation could help here. So I wrote inline
versions of the two problematic functions and was able to reduce running
time by about 25%.

But there's more. gsl_matrix_get_row and gsl_matrix_set_row were also being
called an insane number of times, so I also wrote inline versions of those,
and gained further 10% (approximately).

So in the end the running time was decreased by almost half. The profiler's
output suggests that inlining gsl_vector_min_index may also improve
performance significantly (but probably  less than 10%), but I haven't
tried that.

Gilberto

PS: Sam, when you mentioned Fortran optimizations, were you talking about
pointer aliasing or something else. I never saw an example of Fortran being
significantly faster than C or C++. In the end the compiler makes  much
more difference than the language... Btw, I am not a Fortran programmer
(but I have nothing against it, I just prefer other languages).


On Tue, Oct 8, 2013 at 6:09 AM, Sam Mason <address@hidden>wrote:

> On 7 October 2013 22:02, onefire <address@hidden> wrote:
> > Sam Mason wrote:
> > "I tried this technique with the ODE solvers in the GSL and it gave me
> > about 5% overall performance improvement so I dropped it from my code,
> > it was a fiddle to maintain and the user would barely notice the
> > difference.  I was doing quite a lot of other things, so maybe if your
> > overall time is dominated by malloc/free it may help."
> >
> > I am not surprised by your results because, contrary to what my previous
> > messages might suggest, I think that the main problem is not the
> allocations
> > themselves but memory location. At least for certain problems, the
> machine
> > is just much more efficient at accessing stack memory.
>
> No, all accesses that actually hit RAM (i.e. compiler optimisations
> and caches are important) will go at about the same speed.  I've seen
> comments like this before from users of Fortran, however it has better
> defined semantics that allow a fortran compiler to do more aggressive
> optimization than a C compiler could.  This often gives the impression
> of differences in speed between the stack and heap, but to the CPU
> it's all just  RAM.
>
> I'd recommend benchmarking with a reset, it shouldn't take long and
> may help you understand where the CPU is spending its time.  I've just
> tried testing your hypothesis above and get statistically
> indistinguishable results from the stack and the heap for a variety of
> block sizes that will be contained within L1 cache, L2 cache and hit
> RAM.
>
>   Sam
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]