[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lmi] Contradictory performance measurements
From: |
Greg Chicares |
Subject: |
[lmi] Contradictory performance measurements |
Date: |
Wed, 7 Apr 2021 23:28:11 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 |
Is it faster to divide vectors by a common dividend,
or to multiply them by that dividend's reciprocal?
Only by measuring can we tell. But my measurements
seem to contradict each other (though I know which
has to be the correct one).
I used this experimental patch:
--8<----8<----8<----8<----8<----8<----8<----8<--
diff --git a/round_to.hpp b/round_to.hpp
index 308a91f8e..fce6bf44e 100644
--- a/round_to.hpp
+++ b/round_to.hpp
@@ -359,7 +359,8 @@ inline RealType round_to<RealType>::operator()(RealType r)
const
{
return static_cast<RealType>
( rounding_function_(static_cast<RealType>(r * scale_fwd_))
- * scale_back_
+// * scale_back_
+ / scale_fwd_
);
}
@@ -402,7 +403,8 @@ inline currency round_to<RealType>::c(RealType r) const
{
RealType const z = static_cast<RealType>
( rounding_function_(static_cast<RealType>(r * scale_fwd_))
- * scale_back_cents_
+// * scale_back_cents_
+ / scale_fwd_cents_
);
// CURRENCY !! static_cast: possible range error
return currency(static_cast<currency::data_type>(z), raw_cents {});
--8<----8<----8<----8<----8<----8<----8<----8<--
First, I ran a complete system test, which proved that this
patch has no observable effect on 1500 or so test cases that
fairly represent real-world illustrations. By that top-level
standard, the patch doesn't increase accuracy. (I thought
that it might, because the scaling factors are powers of ten,
so a cached reciprocal like 1/100 entails some representation
error, which dividing by 100 would avoid.)
Now I run
gwc/speed_test.sh
which rebuilds every architecture and runs
make cli_timing
for each one. With the patch, speed decreases by several
percent in general, affecting all architectures. That's
broadly consistent with this ca. 2000 comment:
// Profiling shows that inlining this member function makes a
// realistic application that performs a lot of rounding run about
// five percent faster with gcc.
(that "realistic application" was lmi's predecessor).
[edit: I found the outcome so surprising that I first took
a nap, and then reran the entire experiment de novo; here
are relative errors for the customary lmi timings, before
adding the patch above, versus after reverting it, so that
negative values represent degradation due to the patch:
mean least
naic, no solve : -2% -1%
naic, specamt solve : -3% -2%
naic, ee prem solve : -2% -2%
finra, no solve : -1% -1%
finra, specamt solve: -1% -1%
finra, ee prem solve: -1% -1%
naic, no solve : -1% -4%
naic, specamt solve : -5% -5%
naic, ee prem solve : -5% -5%
finra, no solve : -3% 0%
finra, specamt solve: -4% -4%
finra, ee prem solve: -4% -4%
naic, no solve : -3% -3%
naic, specamt solve : -3% -3%
naic, ee prem solve : -3% -3%
finra, no solve : -1% -1%
finra, specamt solve: -4% -2%
finra, ee prem solve: -3% -2%
The effect is greatest for scenarios that spend more of their
time in the rounding-intensive monthiversary loop.]
Great--I thought--now I can use 'prof' to find out exactly
what's going on; however, with the patch, measuring the same
operation that 'make cli_timing' performs, with these commands...
$LD_LIBRARY_PATH=.:/opt/lmi/bin:/opt/lmi/local/gcc_x86_64-pc-linux-gnu/lib/:/srv/cache_for_lmi/perf_ln
/srv/cache_for_lmi/perf_ln/perf_4.19 record --freq=max --call-graph=lbr
/opt/lmi/bin/lmi_cli_shared --accept --data_path=/opt/lmi/data --selftest
$LD_LIBRARY_PATH=.:/srv/cache_for_lmi/perf_ln
/srv/cache_for_lmi/perf_ln/perf_4.19 report
...and filtering for 'round_to', I see:
0.01% 0.01% lmi_cli_shared liblmi.so [.] round_to<double>::c
0.01% 0.00% lmi_cli_shared liblmi.so [.] round_to<double>::c@plt
0.00% 0.00% lmi_cli_shared liblmi.so [.] round_to<double>::round_to
0.00% 0.00% lmi_cli_shared liblmi.so [.]
round_to<double>::round_to@plt
...which seems to suggest that, even with the patch above,
'round_to' should have virtually no effect on lmi's speed.
How can I resolve this apparent contradiction? I figure I must
be doing something wrong with 'perf', but I carefully confirmed
the experiment by uninstalling and then rebuilding and installing
from scratch. When I run 'perf record', the timings print on the
screen [and they accord with those added in the 'edit' above']:
with patch:
naic, no solve : 2.064e-02 s mean; 20516 us least of 49 runs
naic, specamt solve : 3.791e-02 s mean; 37598 us least of 27 runs
naic, ee prem solve : 3.452e-02 s mean; 34250 us least of 29 runs
finra, no solve : 5.830e-03 s mean; 5611 us least of 100 runs
finra, specamt solve: 2.154e-02 s mean; 21226 us least of 47 runs
finra, ee prem solve: 1.981e-02 s mean; 19484 us least of 51 runs
without patch:
naic, no solve : 1.983e-02 s mean; 19725 us least of 51 runs
naic, specamt solve : 3.639e-02 s mean; 36141 us least of 28 runs
naic, ee prem solve : 3.314e-02 s mean; 33016 us least of 31 runs
finra, no solve : 5.772e-03 s mean; 5537 us least of 100 runs
finra, specamt solve: 2.084e-02 s mean; 20510 us least of 48 runs
finra, ee prem solve: 1.924e-02 s mean; 19084 us least of 52 runs
The "without patch" perf run, filtered for "round_to",
has fewer lines:
0.00% 0.00% lmi_cli_shared liblmi.so [.] round_to<double>::c
0.00% 0.00% lmi_cli_shared liblmi.so [.] round_to<double>::c@plt
showing that only round_to::c() scored high enough to
show up at all, whereas above the ctor also got on the
scoreboard.
Looking harder, I used this command:
$LD_LIBRARY_PATH=.:/srv/cache_for_lmi/perf_ln
/srv/cache_for_lmi/perf_ln/perf_4.19 diff
and looked for "round" as opposed to "round_to"; that's
interesting (here it is after passing through 'grep round'):
0.41% +0.09% liblmi.so [.] detail::round_down<double>
0.52% -0.01% liblmi.so [.] detail::round_up<double>
0.01% -0.00% liblmi.so [.] round_to<double>::c
+0.00% liblmi.so [.]
member_cast<rounding_parameters, rounding_rules>
0.01% -0.00% liblmi.so [.]
MemberSymbolTable<rounding_rules>::operator[]
0.00% +0.00% libc-2.31.so [.] round_and_return
+0.00% liblmi.so [.]
member_cast<rounding_parameters, rounding_rules>@plt
0.00% +0.00% liblmi.so [.] (anonymous
namespace)::set_rounding_rule
0.00% liblmi.so [.] fegetround@plt
0.00% liblmi.so [.] round_to<double>::round_to
0.00% libc-2.31.so [.] round_away
0.00% liblmi.so [.]
mc_enum<rounding_style>::value
0.00% liblmi.so [.] rounding_rules::datum
0.00% liblmi.so [.]
any_member<rounding_rules>::exact_cast<rounding_parameters>
0.00% liblmi.so [.]
rounding_parameters::style@plt
0.00% liblmi.so [.]
rounding_parameters::decimals@plt
0.00% liblmi.so [.] rounding_parameters::style
Indeed round_to is implemented in terms of the
round_{down,up} functions on the first two lines above.
Still, the first ten data lines of unfiltered
'perf diff' output don't mention rounding:
# Baseline Delta Abs Shared Object Symbol
>
# ........ ......... ......................
...................................................>
#
6.33% -0.52% liblmi.so [.] AccountValue::TxSetDeathBft
4.35% +0.37% liblmi.so [.] Irc7702A::DetermineLowestBft
14.01% +0.34% liblmi.so [.] AccountValue::ChangeSpecAmtBy
3.46% -0.22% liblmi.so [.]
AccountValue::DecrementAVProportionally
4.78% -0.22% liblmi.so [.] AccountValue::DoMonthDR
1.48% -0.15% liblmi.so [.] Irc7702A::MaxNecessaryPremium
0.36% +0.15% liblmi.so [.] AccountValue::SurrChg@plt
2.90% -0.13% liblmi.so [.] AccountValue::SurrChg
0.62% -0.12% liblmi.so [.] AccountValue::TxAcceptPayment
0.77% +0.10% liblmi.so [.] AccountValue::TxSetBOMAV
0.99% +0.10% libc-2.31.so [.] _int_free
1.21% +0.10% libc-2.31.so [.] _int_malloc
and all of the "round"-filtered measurements together
account for only about one percent of the 'perf' total
and a tenth of a percent in 'perf diff'.
I have only one theory to explain this: rounding involves
a great number of function calls, each of which is so
inexpensive that it doesn't get counted by 'perf' even
with '--freq=max'; but in the aggregate those calls add
up to an appreciable fraction of lmi's total work.
That's kind of like saying that in code like
x = trunc(erf(a) + arctan(b) / expm1(c));
even if we sample the instruction pointer ten times during
that statement, we might expect to be in the trunc() call
zero times, because trunc() is cheap compared to the other
functions. Does that seem reasonable?
And is there any useful thing I can do with 'perf' here,
or is it just not the right tool for this job?