help-gsl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers nee


From: Gordan Bobic
Subject: Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Date: Fri, 27 Jul 2007 11:31:16 +0100 (BST)

On Fri, 27 Jul 2007, Lionel B wrote:

I am in the middle of programming to solve an engineering problem where
the speed is huge concern. The project involving lots of numerical
integration and then there are several loops/levels of optimization on
top of the function evaluation engine.

A few general rules I've found to help a lot:

- Don't use unnecessary precision if you don't need it. Don't use a
double if a float will do. This is particularly important it code that
the compiler can vectorize. Even if your SIMD vector unit can handle
doubles, it can typically handle twice as many floats as doubles for the
same operation in the same amount of time.

Can you provide evidence to back this up? Although it might once have
been true, my understanding is that on most modern (esp. 64 bit)
processors all floating point maths - vectorised or not - is likely to be
performed internally in double precision and that forcing single
precision may actually result in *slower* code...

Whether or not the internals work with 64-bits or not is totally immaterial. There are two aspects that are _FAR_ more important:

1) Your vector register is only 128-bits wide. You can crunch 4x 32-bit floats or 2x 64-bit doubles in parallel. Regardless of what the FPU does internally, you are performing 2x as many operations on floats as you can with doubles.

2) Some CPUs can vectorize floats but not doubles (e.g. P3/SSE1 only supports floats, but not doubles) so double operations won't vectorize at all.

3) Data size. When your arrays no longer fit in the CPU cache(s), you have to make a round trip to the RAM to fetch them. This involves wait states and latencies. On an ageing P3, this typically adds up to 20-50x slower performance. Fast RAM on the new machines helps with this, but RAM is always going to be an order of magnitude slower. Floats are half the size of doubles. Thus, you can have twice as much data you work with before you have to take a trip to RAM to fetch it.


But just since you asked, here's a quick and dirty benchmark:

All compiled on a 1.4 GHz P3 with 512KB of cache:
icc -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -msse -xK
-vec-report3 -fp-model fast=2 -o test test.c

#include <stdio.h>
int main ()
{
        const float foo = 29.123;

        unsigned int    j;
        unsigned int    i;
        float           ii;
        float           Data[65536];

        for (j = 0; j < 100000; j++)
        {
                for (i = 0, ii = 0; i < 65536; i++)
                {
                        Data[i] = foo * i;
                        ii++;
                }
        }

        printf("%f", Data[i]);
        return 0;
}

Takes 39.52 seconds.

Same code but with Data[i] = foo * ii++; so that it vectorizes using SSE1 on the P3: 5.68 seconds - 7 _TIMES_ faster.

Finally, changing foo, ii and Data[] to double:
(doesn't vectorize on P3): 85.72 seconds.

Using floats instead of doubles can lead to quite significant performance differences.

[...]

- Use a good compiler for your target platform. Sadly, GCC isn't great.
It's not even good. When it comes to tight vectorized loops and you
don't want to reach for the hand crafted assembler, I have seen
performance boosts of between 4.5x (on P3 / Core2) and 8.5x (P4) when
using Intel's ICC compiler instead of GCC.

[...]

My personal experience (on linux x86_64) is that recent versions of GCC
(4.1.x, 4.2.x) have closed the gap quite a lot on ICC (9.x, 10.x) when it
comes to optimisation/vectorisation.

Not that I noticed. ICC 9.1 on x86-64 still makes my code run on the order of 4.5x faster than GCC's best efforts. For a start, GCC doesn't know how to vectorize anything but the simplest operations like adds and multiples. Give it a trig function and things start to take a lot longer.

Last example with doubles on a 1.9GHz Core2/x86-64:
ICC v9.1.051
icc -msse3 -xP -fp-model fast=2
(Note: this is optimizing for Core1 because ICC v9.1 doesn't have Core2 optimizations)
5.295 seconds (doubles vectorize on P4 and later CPUs)

GCC v4.1.1
gcc -march=nocona -mcpu=nocona -mmmx -msse3 -mfpmath=sse -mmni -m64 -floop-optimize2 -ftree-vectorize (Note: optimizing for Nocona because there don't appear to be any Core2 flags, but still using sse3 like ICC is told to) 37.26 seconds again, about 7 times slower than ICC. If anything the gap is now bigger than on the P3.

If you can provide an example of GCC being faster on anything like this order of magnitude, please do.

It also seems to depend *a lot* on
the style of code; as a rough generalisation, ICC, as you point out,
still seems to have a significant lead on vectorisable floating point-
intensive number-crunchers, while for heavy pointer-chasing with complex
data structures (lists, trees, maps, ...) there is less of a difference.

Sure, on non-vectorizable code there is less difference. But there is still a difference. The obvious example is the caching as I was describing in the other thread. Namely, if L1 cache misses, the L2 cache has to try catching. This L2 cache look-up is overhead. If that misses, too, then the loop has to run to calculate the data.

If the said loop is vectorized, then the L2 cache look-up may well be more expensive with it's pointer chasing than just re-crunching the data array. That is why ICC performance doesn't quite show as much with such caching.

On GCC, OTOH, with it's poor loop optimizations, the pointer chasing is more often cheaper than re-calculating the data. But that's not because GCC's pointer chasing is fast - it's because ICC's data crunching is even faster.

On my Intel machine (dual Xeon) ICC will typically outperform GCC on the
same number crunching code by a factor of 2 - 3, while on pointer-chasing
code there is frequently little difference.

Sure - it depends on what your code is doing. There is little to be saved on pointer chasing, because the CPU will be waiting for the data most of the time. That's nothing to do with compiler performance, it's to do with the basic design of the computer it's running on. Memory look-ups are slow. :-(

Oh, and on my AMD64 dual-core machine (with appropriate flags) there is
virtually *no* difference - if anything GCC fares better than ICC (ok,
maybe it's understandable that Intel are not going to be that bothered
about optimising for the competition...).

Don't AMD have their own compiler? I could have sworn I saw it, but I didn't look very hard because I don't have any AMD machines.

[BTW is it just me, or is ICC 10.0 kind of buggy? It seems to ICE on me
quite a lot.]

Yes, ICC v10 is quite badly broken. It doesn't vectorize a lot of code than 9.1 vectorizes, and it produces broken code in certain cases. Have a look on Intel's Software Community forum for more details. I don't use it at the moment.

[...]

Could anybody give some advice/pointers on how to improve the speed of
C/C++ program? How to arrange code? How to make it highly efficient and
super fast? What options do I have if I don't have luxury to use
multi-threaded, multi-core or distributed computing? But I do have a P4
at least. Please recommend some good bibles and resources! Thank you!

On a P4, ICC will utterly annihilate GCC in terms of performance of the
resulting code, especially when it comes to heavy maths.

If you use (recent versions of) GCC make sure you use the -ftree-
vectorize flag.

Which does very little, as I demonstrated above. :-(
Don't get me wrong, I'm not knocking GCC. I'm just a little disappointed that 10 years after SSE was introduced we still don't have a vectorizer to make use of it. That's a long time.

Get a copy andtry it. Enable -vec-report3 and see which of your loops
don't vectorize.

For GCC the equivalent is -ftree-vectorizer-verbose=1 (or 2,3,...).

I prefer level 5, but it makes me cry when I see that it cannot vectorize even very simple things. :'(

Where possible, re-arrange them so that they do vectorize. The compiler
often needs a hand with assorted little hacks to help it vectorize the
code, but they are generally not ugly, and are most of the time limited
to:

1) Copy the object property into a local variable. This will help
persuade compiler that there is no vector dependance it needs to worry
about.

There is also the __restrict__ attribute (I think that works for both ICC
and GCC) to tell the compiler that there is no aliasing of an array (but
don't lie to your compiler!).

Indeed there is - I forgot about that. :-)

[...]

4) Keep your data sizes in mind. If your frequently used data doesn't
fit in the CPU caches, you are likely to start experiencing slow-downs
on the order of 20x or so due to latencies. Use a float when you don't
need a double, as they are half the size.

Again, on modern (especially 64-bit) machines I suspect this may not be
good advice.

Can you demonstrate a case when taking a trip to RAM won't utterly kill performance compared to working in cache?

Also, I guess alignment may be worth bothering about too There's
__attribute__((__aligned__)) which I think works on ICC as well as GCC.

Or on ICC, you can say -align -Zp16 to align all objects, structs and arrays to a 16-byte boundary. Be careful, some code breaks silently with this, though. I saw a version of rsync that seems to work, but never does anything - it just sits there with 100% CPU usage. Use with care.

5) Write the optimized code yourself. GSL and similar libraries are
great for a rapid proof of concept prototype, but there is a price to
pay in terms of performance when using generic code vs. bespoke code
specifically optimized for a particular task.

That has to be balanced against the possibility (certainly in my case)
that the library writer may be smarter than you... ;)

LOL! I have that problem, too. :-)

GSL seems to descend the solution on my problem in fewer iterations. But even though my library is doing 2-3 times more iterations, it goes about 2x as fast. Mind you, the algorithms are different. GSL uses gradient descent via partial derivatives, whereas I do a sampled annealed steepest descent with caching (less sensitive to local minima on a jagged errorscape).

6) Learn the compiler switches for your compiler. Test the accuracy of
your resulting numbers. When you start cutting corners (e.g.
"-ffast-math -mfpmath=sse,387" on GCC, "-fp-model fast=2 -rcd" on ICC)
you may get more speed, but sometimes the precision on your floats will
reduce. This may or may not be acceptable for your application.

[...]

There are hundreds of little hacks you can do to speed your code up. It
is impossible to simmarize them all, and they will differ from project
to project and they won't all be appropriate all the time. I hope this
gets you started on the right path, though. :-)

But be aware too that "micro optimisation" at the level we are talking
about here can often be quite unrewarding in terms of the speed-up vs.
effort/code transparency trade-off.

That depends. If they are in the inner loops and frequently traveled paths, the micro-optimization adds up. Things like doing 1 / divisor, and then multiplying instead of dividing in the loops makes a massive difference.

Ultimately the best optimisation technique is to use the most efficient
algorithm for your problem.

Never have truer words been spoken! :-)

Another part of the problem is that students at universities these days are taught that "the compiler will optimize for you". I suffered the same
lectures on compilers when doing my degree. Unfortunately, while this may
be _partially_ true in theory, but it is mostly wrong in practice. Even when code is compiled with the latest and greatest optimizing compilers, the code "hacks" still frequently make a huge difference. And the differences get bigger as the hardware gets more embedded.

Gordan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]