Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers nee

help-gsl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers nee

From:	Gordan Bobic
Subject:	Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Date:	Fri, 27 Jul 2007 11:31:16 +0100 (BST)

On Fri, 27 Jul 2007, Lionel B wrote:

I am in the middle of programming to solve an engineering problem where
the speed is huge concern. The project involving lots of numerical
integration and then there are several loops/levels of optimization on
top of the function evaluation engine.


A few general rules I've found to help a lot:

- Don't use unnecessary precision if you don't need it. Don't use a
double if a float will do. This is particularly important it code that
the compiler can vectorize. Even if your SIMD vector unit can handle
doubles, it can typically handle twice as many floats as doubles for the
same operation in the same amount of time.


Can you provide evidence to back this up? Although it might once have
been true, my understanding is that on most modern (esp. 64 bit)
processors all floating point maths - vectorised or not - is likely to be
performed internally in double precision and that forcing single
precision may actually result in *slower* code...

Whether or not the internals work with 64-bits or not is totallyimmaterial. There are two aspects that are _FAR_ more important:

1) Your vector register is only 128-bits wide. You can crunch 4x 32-bitfloats or 2x 64-bit doubles in parallel. Regardless of what the FPU doesinternally, you are performing 2x as many operations on floats as you canwith doubles.

2) Some CPUs can vectorize floats but not doubles (e.g. P3/SSE1 onlysupports floats, but not doubles) so double operations won't vectorize atall.

3) Data size. When your arrays no longer fit in the CPU cache(s), you haveto make a round trip to the RAM to fetch them. This involves wait statesand latencies. On an ageing P3, this typically adds up to 20-50x slowerperformance. Fast RAM on the new machines helps with this, but RAM isalways going to be an order of magnitude slower. Floats are half the sizeof doubles. Thus, you can have twice as much data you work with before youhave to take a trip to RAM to fetch it.



But just since you asked, here's a quick and dirty benchmark:

All compiled on a 1.4 GHz P3 with 512KB of cache:
icc -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -msse -xK
-vec-report3 -fp-model fast=2 -o test test.c

#include <stdio.h>
int main ()
{
        const float foo = 29.123;

        unsigned int    j;
        unsigned int    i;
        float           ii;
        float           Data[65536];

        for (j = 0; j < 100000; j++)
        {
                for (i = 0, ii = 0; i < 65536; i++)
                {
                        Data[i] = foo * i;
                        ii++;
                }
        }

        printf("%f", Data[i]);
        return 0;
}

Takes 39.52 seconds.

Same code but withData[i] = foo * ii++;so that it vectorizes using SSE1 on the P3: 5.68 seconds - 7 _TIMES_faster.


Finally, changing foo, ii and Data[] to double:
(doesn't vectorize on P3): 85.72 seconds.

Using floats instead of doubles can lead to quite significant performancedifferences.

[...]

- Use a good compiler for your target platform. Sadly, GCC isn't great.
It's not even good. When it comes to tight vectorized loops and you
don't want to reach for the hand crafted assembler, I have seen
performance boosts of between 4.5x (on P3 / Core2) and 8.5x (P4) when
using Intel's ICC compiler instead of GCC.


[...]

My personal experience (on linux x86_64) is that recent versions of GCC
(4.1.x, 4.2.x) have closed the gap quite a lot on ICC (9.x, 10.x) when it
comes to optimisation/vectorisation.

Not that I noticed. ICC 9.1 on x86-64 still makes my code run on the orderof 4.5x faster than GCC's best efforts. For a start, GCC doesn't know howto vectorize anything but the simplest operations like adds and multiples.Give it a trig function and things start to take a lot longer.


Last example with doubles on a 1.9GHz Core2/x86-64:
ICC v9.1.051
icc -msse3 -xP -fp-model fast=2

(Note: this is optimizing for Core1 because ICC v9.1 doesn't haveCore2 optimizations)

5.295 seconds (doubles vectorize on P4 and later CPUs)

GCC v4.1.1

gcc -march=nocona -mcpu=nocona -mmmx -msse3 -mfpmath=sse -mmni-m64 -floop-optimize2 -ftree-vectorize(Note: optimizing for Nocona because there don't appear to be any Core2flags, but still using sse3 like ICC is told to)37.26 seconds again, about 7 times slower than ICC. If anything the gap isnow bigger than on the P3.

If you can provide an example of GCC being faster on anything like thisorder of magnitude, please do.

It also seems to depend *a lot* on
the style of code; as a rough generalisation, ICC, as you point out,
still seems to have a significant lead on vectorisable floating point-
intensive number-crunchers, while for heavy pointer-chasing with complex
data structures (lists, trees, maps, ...) there is less of a difference.

Sure, on non-vectorizable code there is less difference. But there isstill a difference. The obvious example is the caching as I was describingin the other thread. Namely, if L1 cache misses, the L2 cache has to trycatching. This L2 cache look-up is overhead. If that misses, too, then theloop has to run to calculate the data.

If the said loop is vectorized, then the L2 cache look-up may well be moreexpensive with it's pointer chasing than just re-crunching the data array.That is why ICC performance doesn't quite show as much with such caching.

On GCC, OTOH, with it's poor loop optimizations, the pointer chasing ismore often cheaper than re-calculating the data. But that's not becauseGCC's pointer chasing is fast - it's because ICC's data crunching is evenfaster.

On my Intel machine (dual Xeon) ICC will typically outperform GCC on the
same number crunching code by a factor of 2 - 3, while on pointer-chasing
code there is frequently little difference.

Sure - it depends on what your code is doing. There is little to be savedon pointer chasing, because the CPU will be waiting for the data most ofthe time. That's nothing to do with compiler performance, it's to do withthe basic design of the computer it's running on. Memory look-ups areslow. :-(

Oh, and on my AMD64 dual-core machine (with appropriate flags) there is
virtually *no* difference - if anything GCC fares better than ICC (ok,
maybe it's understandable that Intel are not going to be that bothered
about optimising for the competition...).

Don't AMD have their own compiler? I could have sworn I saw it, but Ididn't look very hard because I don't have any AMD machines.

[BTW is it just me, or is ICC 10.0 kind of buggy? It seems to ICE on me
quite a lot.]

Yes, ICC v10 is quite badly broken. It doesn't vectorize a lot of codethan 9.1 vectorizes, and it produces broken code in certain cases. Have alook on Intel's Software Community forum for more details. I don't use itat the moment.

[...]

Could anybody give some advice/pointers on how to improve the speed of
C/C++ program? How to arrange code? How to make it highly efficient and
super fast? What options do I have if I don't have luxury to use
multi-threaded, multi-core or distributed computing? But I do have a P4
at least. Please recommend some good bibles and resources! Thank you!


On a P4, ICC will utterly annihilate GCC in terms of performance of the
resulting code, especially when it comes to heavy maths.


If you use (recent versions of) GCC make sure you use the -ftree-
vectorize flag.


Which does very little, as I demonstrated above. :-(

Don't get me wrong, I'm not knocking GCC. I'm just a little disappointedthat 10 years after SSE was introduced we still don't have a vectorizer tomake use of it. That's a long time.

Get a copy andtry it. Enable -vec-report3 and see which of your loops
don't vectorize.


For GCC the equivalent is -ftree-vectorizer-verbose=1 (or 2,3,...).

I prefer level 5, but it makes me cry when I see that it cannot vectorizeeven very simple things. :'(

Where possible, re-arrange them so that they do vectorize. The compiler
often needs a hand with assorted little hacks to help it vectorize the
code, but they are generally not ugly, and are most of the time limited
to:

1) Copy the object property into a local variable. This will help
persuade compiler that there is no vector dependance it needs to worry
about.


There is also the __restrict__ attribute (I think that works for both ICC
and GCC) to tell the compiler that there is no aliasing of an array (but
don't lie to your compiler!).


Indeed there is - I forgot about that. :-)

[...]

4) Keep your data sizes in mind. If your frequently used data doesn't
fit in the CPU caches, you are likely to start experiencing slow-downs
on the order of 20x or so due to latencies. Use a float when you don't
need a double, as they are half the size.


Again, on modern (especially 64-bit) machines I suspect this may not be
good advice.

Can you demonstrate a case when taking a trip to RAM won't utterly killperformance compared to working in cache?

Also, I guess alignment may be worth bothering about too There's
__attribute__((__aligned__)) which I think works on ICC as well as GCC.

Or on ICC, you can say -align -Zp16 to align all objects, structs andarrays to a 16-byte boundary. Be careful, some code breaks silently withthis, though. I saw a version of rsync that seems to work, but never doesanything - it just sits there with 100% CPU usage. Use with care.

5) Write the optimized code yourself. GSL and similar libraries are
great for a rapid proof of concept prototype, but there is a price to
pay in terms of performance when using generic code vs. bespoke code
specifically optimized for a particular task.


That has to be balanced against the possibility (certainly in my case)
that the library writer may be smarter than you... ;)


LOL! I have that problem, too. :-)

GSL seems to descend the solution on my problem in fewer iterations. Buteven though my library is doing 2-3 times more iterations, it goes about2x as fast. Mind you, the algorithms are different. GSL uses gradientdescent via partial derivatives, whereas I do a sampled annealed steepestdescent with caching (less sensitive to local minima on a jaggederrorscape).

6) Learn the compiler switches for your compiler. Test the accuracy of
your resulting numbers. When you start cutting corners (e.g.
"-ffast-math -mfpmath=sse,387" on GCC, "-fp-model fast=2 -rcd" on ICC)
you may get more speed, but sometimes the precision on your floats will
reduce. This may or may not be acceptable for your application.


[...]

There are hundreds of little hacks you can do to speed your code up. It
is impossible to simmarize them all, and they will differ from project
to project and they won't all be appropriate all the time. I hope this
gets you started on the right path, though. :-)


But be aware too that "micro optimisation" at the level we are talking
about here can often be quite unrewarding in terms of the speed-up vs.
effort/code transparency trade-off.

That depends. If they are in the inner loops and frequently traveledpaths, the micro-optimization adds up. Things like doing 1 / divisor, andthen multiplying instead of dividing in the loops makes a massivedifference.

Ultimately the best optimisation technique is to use the most efficient
algorithm for your problem.


Never have truer words been spoken! :-)

Another part of the problem is that students at universities these daysare taught that "the compiler will optimize for you". I suffered the same

lectures on compilers when doing my degree. Unfortunately, while this may

be _partially_ true in theory, but it is mostly wrong in practice. Evenwhen code is compiled with the latest and greatest optimizing compilers,the code "hacks" still frequently make a huge difference. And thedifferences get bigger as the hardware gets more embedded.


Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

[Help-gsl] C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Michael, 2007/07/27
- Re: [Help-gsl] C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Gordan Bobic, 2007/07/27
  - [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Lionel B, 2007/07/27
    - Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Gordan Bobic <=
    - Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Jochen Küpper, 2007/07/27
    - Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Gordan Bobic, 2007/07/27
    - Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Jochen Küpper, 2007/07/27
    - Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Gordan Bobic, 2007/07/27
    - Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL..., Marc Baaden, 2007/07/27

Prev by Date: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Next by Date: Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Previous by thread: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Next by thread: Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Index(es):
- Date
- Thread