[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Contradictory performance measurements
From: |
Greg Chicares |
Subject: |
Re: [lmi] Contradictory performance measurements |
Date: |
Thu, 8 Apr 2021 15:52:05 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 |
On 4/8/21 1:10 PM, Vadim Zeitlin wrote:
> On Wed, 7 Apr 2021 23:28:11 +0000 Greg Chicares <gchicares@sbcglobal.net>
> wrote:
>
> GC> Is it faster to divide vectors by a common dividend,
> GC> or to multiply them by that dividend's reciprocal?
> GC> Only by measuring can we tell. But my measurements
> GC> seem to contradict each other (though I know which
> GC> has to be the correct one).
>
> I would definitely expect multiplication to be faster. The exact timings
> depend on whether x87 instructions or SSE ones are used, but division is
> supposed to be ~3 times slower, I believe, although it of course depends on
> the data used, so measuring is still a good idea.
Agreed. Soon I plan to post unit tests that measure that directly.
> GC> Great--I thought--now I can use 'prof' to find out exactly
> GC> what's going on;
>
> I'm not sure I see the appeal of using perf here, don't we already know
> what's going on? Considering that the patch does one small change (well, 2
> small changes), it seems like we already have all the information we need.
I guess I'm chasing a will-o'-the-wisp. I imagine that the CPU
executes instructions that are more or less literal translations
of my C code (C++, but it may as well be C), and that 'perf' is
a tool that can give me a three-column listing {C, asm, timing}
for each instruction.
I know computers have changed a lot since the 1980s. It's just
hard to let go of old ideas.
> I think it might be because it's getting fully inlined and perf doesn't
> affect all the time spent in the instructions corresponding to it to the
> function itself.
Yes, I'm convinced.
I'm still inclined to suppose that some perf-like tool could
ingeniously reverse the inlining, deducing which divisions are
due to rounding and attributing them to 'round_to', e.g.:
# x = rint(x,y) / z;
MUL x,y # 1 clock
RINT x # 1 clock
DIV x,z # 30 clocks
# alternative
# x = rint(x,y) * recip;
MUL X,recip # 1 clock
so that I can just read the output and know I can save (30-1) clocks
by using reciprocal-multiplication (in a theoretical world where
wall-clock time is just the sum of CPU cycles without regard to
matters like pipelining or cache locality).
And such a tool would attribute so many costly divisions to
currency::d() that it would rise to the top of the chart as
an unmistakable hotspot (as your investigation below finds).
But no such tool is likely to be created, or to be as useful
with today's hardware as it would have been half a century ago.
> GC> How can I resolve this apparent contradiction?
>
> By concluding that it's only apparent :-)
Yup.
> Just for the reference, my absolute numbers are
[...of a very similar complexion...]
> for a newer (but still pretty old) i7-4712HQ one. I think we've discussed
> this in the past, but it seems clear that Xeon is not ideal for running lmi
> if a 7 year old notebook CPU can beat it so significantly.
Yes, we have. Fewer but faster cores, vs. more but slower.
BTW, I had thought that as RAM increased to multiple GB, ECC would
become imperative. But I guess cosmic-ray bit flips aren't actually
crashing your machine every few hours, or you'd have mentioned it.
> GC> Indeed round_to is implemented in terms of the
> GC> round_{down,up} functions on the first two lines above.
>
> These functions together seem to take ~0.75% of the total running time for
> the "naic, no solve" scenario for me, with the lion's share for round_up().
BTW, IIRC, the intention of helper functions like round_up() was to
round without needing to change the hardware rounding direction, in
a pre-C99 world. Their design purpose may be invalid today.
> [...] you can get something
> really useful from examining the annotated listing because, from just a
> very superficial look at it, you can see that a lot of time is taken by
> dividing by cents_per_dollar in currency::d(). There is, of course, already
> a comment there about this, and I can't answer the question about
> correctness there, but applying
> ---------------------------------- >8 --------------------------------------
> diff --git a/currency.hpp b/currency.hpp
> index ce04d75a2..a73a2b627 100644
> --- a/currency.hpp
> +++ b/currency.hpp
> @@ -40,6 +40,7 @@ class currency
>
> static constexpr int cents_digits = 2;
> static constexpr double cents_per_dollar = 100.0;
> + static constexpr double cents_per_dollar_inv = 0.01;
>
> public:
> using data_type = double;
> @@ -58,8 +59,7 @@ class currency
>
> data_type cents() const {return m_;}
> // CURRENCY !! add a unit test for possible underflow
> - // CURRENCY !! is multiplication by reciprocal faster or more accurate?
> - double d() const {return m_ / cents_per_dollar;}
> + double d() const {return m_ * cents_per_dollar_inv;}
>
> private:
> explicit currency(data_type z, raw_cents) : m_ {z} {}
> ---------------------------------- >8 --------------------------------------
> does result in a dramatic speed gain in the selftest, which now gives
[...]
> i.e. a ~20% speedup.
I'm surprised it runs for you. Here, I get:
Database key 'MinIssSpecAmt: value 50000 not preserved in conversion to 5e+06
cents.
The reason is that, given a double D whose value is an exact
integer congruent to 0 (mod 100),
D / 100.0
is an exact integer, but
D * 0.01
is not.