lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Contradictory performance measurements


From: Greg Chicares
Subject: [lmi] Contradictory performance measurements
Date: Wed, 7 Apr 2021 23:28:11 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0

Is it faster to divide vectors by a common dividend,
or to multiply them by that dividend's reciprocal?
Only by measuring can we tell. But my measurements
seem to contradict each other (though I know which
has to be the correct one).

I used this experimental patch:

--8<----8<----8<----8<----8<----8<----8<----8<--
diff --git a/round_to.hpp b/round_to.hpp
index 308a91f8e..fce6bf44e 100644
--- a/round_to.hpp
+++ b/round_to.hpp
@@ -359,7 +359,8 @@ inline RealType round_to<RealType>::operator()(RealType r) 
const
 {
     return static_cast<RealType>
         ( rounding_function_(static_cast<RealType>(r * scale_fwd_))
-        * scale_back_
+//      * scale_back_
+        / scale_fwd_
         );
 }
 
@@ -402,7 +403,8 @@ inline currency round_to<RealType>::c(RealType r) const
 {
     RealType const z = static_cast<RealType>
         ( rounding_function_(static_cast<RealType>(r * scale_fwd_))
-        * scale_back_cents_
+//      * scale_back_cents_
+        / scale_fwd_cents_
         );
     // CURRENCY !! static_cast: possible range error
     return currency(static_cast<currency::data_type>(z), raw_cents {});
--8<----8<----8<----8<----8<----8<----8<----8<--

First, I ran a complete system test, which proved that this
patch has no observable effect on 1500 or so test cases that
fairly represent real-world illustrations. By that top-level
standard, the patch doesn't increase accuracy. (I thought
that it might, because the scaling factors are powers of ten,
so a cached reciprocal like 1/100 entails some representation
error, which dividing by 100 would avoid.)

Now I run
  gwc/speed_test.sh
which rebuilds every architecture and runs
  make cli_timing
for each one. With the patch, speed decreases by several
percent in general, affecting all architectures. That's
broadly consistent with this ca. 2000 comment:
  // Profiling shows that inlining this member function makes a
  // realistic application that performs a lot of rounding run about
  // five percent faster with gcc.
(that "realistic application" was lmi's predecessor).

[edit: I found the outcome so surprising that I first took
a nap, and then reran the entire experiment de novo; here
are relative errors for the customary lmi timings, before
adding the patch above, versus after reverting it, so that
negative values represent degradation due to the patch:

                        mean least
  naic, no solve      :  -2%  -1%
  naic, specamt solve :  -3%  -2%
  naic, ee prem solve :  -2%  -2%
  finra, no solve     :  -1%  -1%
  finra, specamt solve:  -1%  -1%
  finra, ee prem solve:  -1%  -1%

  naic, no solve      :  -1%  -4%
  naic, specamt solve :  -5%  -5%
  naic, ee prem solve :  -5%  -5%
  finra, no solve     :  -3%   0%
  finra, specamt solve:  -4%  -4%
  finra, ee prem solve:  -4%  -4%

  naic, no solve      :  -3%  -3%
  naic, specamt solve :  -3%  -3%
  naic, ee prem solve :  -3%  -3%
  finra, no solve     :  -1%  -1%
  finra, specamt solve:  -4%  -2%
  finra, ee prem solve:  -3%  -2%

The effect is greatest for scenarios that spend more of their
time in the rounding-intensive monthiversary loop.]

Great--I thought--now I can use 'prof' to find out exactly
what's going on; however, with the patch, measuring the same
operation that 'make cli_timing' performs, with these commands...

$LD_LIBRARY_PATH=.:/opt/lmi/bin:/opt/lmi/local/gcc_x86_64-pc-linux-gnu/lib/:/srv/cache_for_lmi/perf_ln
 /srv/cache_for_lmi/perf_ln/perf_4.19 record --freq=max --call-graph=lbr 
/opt/lmi/bin/lmi_cli_shared --accept --data_path=/opt/lmi/data --selftest

$LD_LIBRARY_PATH=.:/srv/cache_for_lmi/perf_ln 
/srv/cache_for_lmi/perf_ln/perf_4.19 report

...and filtering for 'round_to', I see:

     0.01%     0.01%  lmi_cli_shared  liblmi.so  [.] round_to<double>::c
     0.01%     0.00%  lmi_cli_shared  liblmi.so  [.] round_to<double>::c@plt
     0.00%     0.00%  lmi_cli_shared  liblmi.so  [.] round_to<double>::round_to
     0.00%     0.00%  lmi_cli_shared  liblmi.so  [.] 
round_to<double>::round_to@plt

...which seems to suggest that, even with the patch above,
'round_to' should have virtually no effect on lmi's speed.

How can I resolve this apparent contradiction? I figure I must
be doing something wrong with 'perf', but I carefully confirmed
the experiment by uninstalling and then rebuilding and installing
from scratch. When I run 'perf record', the timings print on the
screen [and they accord with those added in the 'edit' above']:

with patch:
  naic, no solve      : 2.064e-02 s mean;      20516 us least of  49 runs
  naic, specamt solve : 3.791e-02 s mean;      37598 us least of  27 runs
  naic, ee prem solve : 3.452e-02 s mean;      34250 us least of  29 runs
  finra, no solve     : 5.830e-03 s mean;       5611 us least of 100 runs
  finra, specamt solve: 2.154e-02 s mean;      21226 us least of  47 runs
  finra, ee prem solve: 1.981e-02 s mean;      19484 us least of  51 runs

without patch:
  naic, no solve      : 1.983e-02 s mean;      19725 us least of  51 runs
  naic, specamt solve : 3.639e-02 s mean;      36141 us least of  28 runs
  naic, ee prem solve : 3.314e-02 s mean;      33016 us least of  31 runs
  finra, no solve     : 5.772e-03 s mean;       5537 us least of 100 runs
  finra, specamt solve: 2.084e-02 s mean;      20510 us least of  48 runs
  finra, ee prem solve: 1.924e-02 s mean;      19084 us least of  52 runs

The "without patch" perf run, filtered for "round_to",
has fewer lines:
     0.00%     0.00%  lmi_cli_shared  liblmi.so  [.] round_to<double>::c
     0.00%     0.00%  lmi_cli_shared  liblmi.so  [.] round_to<double>::c@plt
showing that only round_to::c() scored high enough to
show up at all, whereas above the ctor also got on the
scoreboard.

Looking harder, I used this command:

$LD_LIBRARY_PATH=.:/srv/cache_for_lmi/perf_ln 
/srv/cache_for_lmi/perf_ln/perf_4.19 diff

and looked for "round" as opposed to "round_to"; that's
interesting (here it is after passing through 'grep round'):

     0.41%     +0.09%  liblmi.so               [.] detail::round_down<double>
     0.52%     -0.01%  liblmi.so               [.] detail::round_up<double>
     0.01%     -0.00%  liblmi.so               [.] round_to<double>::c
               +0.00%  liblmi.so               [.] 
member_cast<rounding_parameters, rounding_rules>
     0.01%     -0.00%  liblmi.so               [.] 
MemberSymbolTable<rounding_rules>::operator[]
     0.00%     +0.00%  libc-2.31.so            [.] round_and_return
               +0.00%  liblmi.so               [.] 
member_cast<rounding_parameters, rounding_rules>@plt
     0.00%     +0.00%  liblmi.so               [.] (anonymous 
namespace)::set_rounding_rule
     0.00%             liblmi.so               [.] fegetround@plt
     0.00%             liblmi.so               [.] round_to<double>::round_to
     0.00%             libc-2.31.so            [.] round_away
     0.00%             liblmi.so               [.] 
mc_enum<rounding_style>::value
     0.00%             liblmi.so               [.] rounding_rules::datum
     0.00%             liblmi.so               [.] 
any_member<rounding_rules>::exact_cast<rounding_parameters>
     0.00%             liblmi.so               [.] 
rounding_parameters::style@plt
     0.00%             liblmi.so               [.] 
rounding_parameters::decimals@plt
     0.00%             liblmi.so               [.] rounding_parameters::style

Indeed round_to is implemented in terms of the
round_{down,up} functions on the first two lines above.
Still, the first ten data lines of unfiltered
'perf diff' output don't mention rounding:

# Baseline  Delta Abs  Shared Object           Symbol                           
                  >
# ........  .........  ......................  
...................................................>
#
     6.33%     -0.52%  liblmi.so               [.] AccountValue::TxSetDeathBft
     4.35%     +0.37%  liblmi.so               [.] Irc7702A::DetermineLowestBft
    14.01%     +0.34%  liblmi.so               [.] AccountValue::ChangeSpecAmtBy
     3.46%     -0.22%  liblmi.so               [.] 
AccountValue::DecrementAVProportionally
     4.78%     -0.22%  liblmi.so               [.] AccountValue::DoMonthDR
     1.48%     -0.15%  liblmi.so               [.] Irc7702A::MaxNecessaryPremium
     0.36%     +0.15%  liblmi.so               [.] AccountValue::SurrChg@plt
     2.90%     -0.13%  liblmi.so               [.] AccountValue::SurrChg
     0.62%     -0.12%  liblmi.so               [.] AccountValue::TxAcceptPayment
     0.77%     +0.10%  liblmi.so               [.] AccountValue::TxSetBOMAV
     0.99%     +0.10%  libc-2.31.so            [.] _int_free
     1.21%     +0.10%  libc-2.31.so            [.] _int_malloc

and all of the "round"-filtered measurements together
account for only about one percent of the 'perf' total
and a tenth of a percent in 'perf diff'.

I have only one theory to explain this: rounding involves
a great number of function calls, each of which is so
inexpensive that it doesn't get counted by 'perf' even
with '--freq=max'; but in the aggregate those calls add
up to an appreciable fraction of lmi's total work.
That's kind of like saying that in code like
  x = trunc(erf(a) + arctan(b) / expm1(c));
even if we sample the instruction pointer ten times during
that statement, we might expect to be in the trunc() call
zero times, because trunc() is cheap compared to the other
functions. Does that seem reasonable?

And is there any useful thing I can do with 'perf' here,
or is it just not the right tool for this job?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]