help-gsl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed,


From: Lionel B
Subject: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Date: Fri, 27 Jul 2007 09:18:11 +0000 (UTC)
User-agent: Pan/0.131 (Ghosts: First Variation)

On Fri, 27 Jul 2007 08:51:19 +0100, Gordan Bobic wrote:

> On Fri, 27 Jul 2007, Michael wrote:
> 
>> I am in the middle of programming to solve an engineering problem where
>> the speed is huge concern. The project involving lots of numerical
>> integration and then there are several loops/levels of optimization on
>> top of the function evaluation engine.
> 
> A few general rules I've found to help a lot:
> 
> - Don't use unnecessary precision if you don't need it. Don't use a
> double if a float will do. This is particularly important it code that
> the compiler can vectorize. Even if your SIMD vector unit can handle
> doubles, it can typically handle twice as many floats as doubles for the
> same operation in the same amount of time.

Can you provide evidence to back this up? Although it might once have 
been true, my understanding is that on most modern (esp. 64 bit)
processors all floating point maths - vectorised or not - is likely to be 
performed internally in double precision and that forcing single 
precision may actually result in *slower* code...

[...]

> - Use a good compiler for your target platform. Sadly, GCC isn't great.
> It's not even good. When it comes to tight vectorized loops and you
> don't want to reach for the hand crafted assembler, I have seen
> performance boosts of between 4.5x (on P3 / Core2) and 8.5x (P4) when
> using Intel's ICC compiler instead of GCC.

[...]

My personal experience (on linux x86_64) is that recent versions of GCC 
(4.1.x, 4.2.x) have closed the gap quite a lot on ICC (9.x, 10.x) when it 
comes to optimisation/vectorisation. It also seems to depend *a lot* on 
the style of code; as a rough generalisation, ICC, as you point out,  
still seems to have a significant lead on vectorisable floating point-
intensive number-crunchers, while for heavy pointer-chasing with complex 
data structures (lists, trees, maps, ...) there is less of a difference. 
On my Intel machine (dual Xeon) ICC will typically outperform GCC on the 
same number crunching code by a factor of 2 - 3, while on pointer-chasing 
code there is frequently little difference.

Oh, and on my AMD64 dual-core machine (with appropriate flags) there is 
virtually *no* difference - if anything GCC fares better than ICC (ok, 
maybe it's understandable that Intel are not going to be that bothered 
about optimising for the competition...).

[BTW is it just me, or is ICC 10.0 kind of buggy? It seems to ICE on me 
quite a lot.]

[...]

>> Could anybody give some advice/pointers on how to improve the speed of
>> C/C++ program? How to arrange code? How to make it highly efficient and
>> super fast? What options do I have if I don't have luxury to use
>> multi-threaded, multi-core or distributed computing? But I do have a P4
>> at least. Please recommend some good bibles and resources! Thank you!
> 
> On a P4, ICC will utterly annihilate GCC in terms of performance of the
> resulting code, especially when it comes to heavy maths.

If you use (recent versions of) GCC make sure you use the -ftree-
vectorize flag.

> Get a copy andtry it. Enable -vec-report3 and see which of your loops
> don't vectorize.

For GCC the equivalent is -ftree-vectorizer-verbose=1 (or 2,3,...).

> Where possible, re-arrange them so that they do vectorize. The compiler
> often needs a hand with assorted little hacks to help it vectorize the
> code, but they are generally not ugly, and are most of the time limited
> to:
> 
> 1) Copy the object property into a local variable. This will help
> persuade compiler that there is no vector dependance it needs to worry
> about.

There is also the __restrict__ attribute (I think that works for both ICC 
and GCC) to tell the compiler that there is no aliasing of an array (but 
don't lie to your compiler!).

[...]

> 4) Keep your data sizes in mind. If your frequently used data doesn't
> fit in the CPU caches, you are likely to start experiencing slow-downs
> on the order of 20x or so due to latencies. Use a float when you don't
> need a double, as they are half the size.

Again, on modern (especially 64-bit) machines I suspect this may not be 
good advice.

Also, I guess alignment may be worth bothering about too There's 
__attribute__((__aligned__)) which I think works on ICC as well as GCC.

> 5) Write the optimized code yourself. GSL and similar libraries are
> great for a rapid proof of concept prototype, but there is a price to
> pay in terms of performance when using generic code vs. bespoke code
> specifically optimized for a particular task.

That has to be balanced against the possibility (certainly in my case)
that the library writer may be smarter than you... ;)

> 6) Learn the compiler switches for your compiler. Test the accuracy of
> your resulting numbers. When you start cutting corners (e.g.
> "-ffast-math -mfpmath=sse,387" on GCC, "-fp-model fast=2 -rcd" on ICC)
> you may get more speed, but sometimes the precision on your floats will
> reduce. This may or may not be acceptable for your application.

[...]

> There are hundreds of little hacks you can do to speed your code up. It
> is impossible to simmarize them all, and they will differ from project
> to project and they won't all be appropriate all the time. I hope this
> gets you started on the right path, though. :-)

But be aware too that "micro optimisation" at the level we are talking 
about here can often be quite unrewarding in terms of the speed-up vs. 
effort/code transparency trade-off.

Ultimately the best optimisation technique is to use the most efficient 
algorithm for your problem.

Regards,

-- 
Lionel B





reply via email to

[Prev in Thread] Current Thread [Next in Thread]