[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] A use case for long double
From: |
Greg Chicares |
Subject: |
Re: [lmi] A use case for long double |
Date: |
Sun, 1 May 2022 19:23:15 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 |
On 5/1/22 01:36, Vadim Zeitlin wrote:
> On Sat, 30 Apr 2022 17:46:05 +0000 Greg Chicares <gchicares@sbcglobal.net>
> wrote:
>
> GC> In preparation for migrating lmi releases from 32- to 64-bit binaries,
> GC> I've been reconsidering lmi's use of type 'long double'. I postulate
> GC> that 'long double' should not be used in place of 'double' without a
> GC> convincing rationale, because it's less common in practice and because
> GC> it's presumably slower for x86_64.
>
> FWIW, this is exactly what I thought too...
I still cling to the principle, but now I would 's/because.*$//'.
But a principle without a rationale is merely a tenet of faith,
which we workers should question on this holy day.
> GC> I had anticipated that IRR calculations would be faster, though
> GC> somewhat (but perhaps tolerably) less accurate using binary64.
> GC> However, see:
> GC>
> https://git.savannah.nongnu.org/cgit/lmi.git/commit/?h=odd/eraseme_long_double_irr
> GC> It looks like we should keep the existing binary80 IRR code, because
> GC> it's no slower, and achieves an extra two digits of precision in a
> GC> not-implausible test case.
> GC>
> GC> The apparent lack of a speed penalty came as a surprise to me,
> GC> but we follow the evidence wherever it may lead.
>
> Yes, but it would be really nice to understand how is this possible: I
> just don't understand how could the legacy x87 part of the hardware be
> faster (even in absolute terms, not just "per bit of precision") than the
> much more recent SSE instructions.
Speculation: The x87 hardware is an extra FPU core, normally
unused for x86_64; offloading some instructions to it brings it
into use, so we're "firing on all cylinders", and why couldn't
that be faster? In 8086 days, it was possible to gain some
speed by doing a few integer math operations on an 8087, as
long as the synchronization worked out all right (i.e., using
the FN- no-wait 8087 instructions).
> I wonder if the results could be better
> if we use some more aggressive code generation options, e.g. if you could
> perhaps rerun the benchmarks with -march=native compilation option? This
> seems unlikely, but maybe the compiler generates some very suboptimal SSE
> code because it keeps compatibility with some very old micro-architectures
> by default?
Architecture: x86_64-pc-linux-gnu; compiler: gcc-11.2.0
command:
$make $coefficiency unit_tests unit_test_targets=financial_test 2>&1 |grep
form
baseline (HEAD):
iterator form: 1.440e-02 s mean; 13512 us least of 70 runs
container form: 1.361e-02 s mean; 13387 us least of 74 runs
make-clean, emphatically:
$make raze
change options:
$sed -i workhorse.make -e's/-O2/-O3/' -e's/-frounding-math/-ffast-math
-march=native/'
result:
iterator form: 1.412e-02 s mean; 13102 us least of 71 runs
container form: 1.338e-02 s mean; 13105 us least of 75 runs
switch to branch with double in lieu of long double:
$git switch odd/eraseme_long_double_irr
M workhorse.make
Switched to branch 'odd/eraseme_long_double_irr'
Your branch is up to date with 'origin/odd/eraseme_long_double_irr'.
verify--yes this command fails:
$grep 'long double' financial.hpp
raze again:
$make raze
result:
iterator form: 1.383e-02 s mean; 13207 us least of 73 runs
container form: 1.373e-02 s mean; 13179 us least of 73 runs
Thus, TL;DR: "no".
> GC> It would be conceivable to do IRR calculations using expm1() and
> GC> log1p(), but that doesn't seem attractive. The principal part of
> GC> the calculation is evaluation of NPV, the inner product of a stream
> GC> of values ("cash flows") and a vector of powers (1+i)^n, n=0,1,2...
>
> Shouldn't this calculation be vectorizable then? I'm sorry, I didn't look
> at the details yet, but if there is any chance of being able to vectorize a
> loop, it would be worth doing it as this could result in really spectacular
> gains.
You needn't look at the details in lmi; the question is simply
whether a polynomial evaluation can be vectorized, whether using
Horner's rule or some other algorithm.
I haven't looked for my copy of TAOCP yet, but I have tried
FMA, and that will be the topic of a separate message.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [lmi] A use case for long double,
Greg Chicares <=