Re: [lmi] A use case for long double

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] A use case for long double

From:	Greg Chicares
Subject:	Re: [lmi] A use case for long double
Date:	Sun, 1 May 2022 19:23:15 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0

On 5/1/22 01:36, Vadim Zeitlin wrote:
> On Sat, 30 Apr 2022 17:46:05 +0000 Greg Chicares <gchicares@sbcglobal.net> 
> wrote:
> 
> GC> In preparation for migrating lmi releases from 32- to 64-bit binaries,
> GC> I've been reconsidering lmi's use of type 'long double'. I postulate
> GC> that 'long double' should not be used in place of 'double' without a
> GC> convincing rationale, because it's less common in practice and because
> GC> it's presumably slower for x86_64.
> 
>  FWIW, this is exactly what I thought too...

I still cling to the principle, but now I would 's/because.*$//'.
But a principle without a rationale is merely a tenet of faith,
which we workers should question on this holy day.

> GC> I had anticipated that IRR calculations would be faster, though
> GC> somewhat (but perhaps tolerably) less accurate using binary64.
> GC> However, see:
> GC>   
> https://git.savannah.nongnu.org/cgit/lmi.git/commit/?h=odd/eraseme_long_double_irr
> GC> It looks like we should keep the existing binary80 IRR code, because
> GC> it's no slower, and achieves an extra two digits of precision in a
> GC> not-implausible test case.
> GC> 
> GC> The apparent lack of a speed penalty came as a surprise to me,
> GC> but we follow the evidence wherever it may lead.
> 
>  Yes, but it would be really nice to understand how is this possible: I
> just don't understand how could the legacy x87 part of the hardware be
> faster (even in absolute terms, not just "per bit of precision") than the
> much more recent SSE instructions.

Speculation: The x87 hardware is an extra FPU core, normally
unused for x86_64; offloading some instructions to it brings it
into use, so we're "firing on all cylinders", and why couldn't
that be faster? In 8086 days, it was possible to gain some
speed by doing a few integer math operations on an 8087, as
long as the synchronization worked out all right (i.e., using
the FN- no-wait 8087 instructions).

> I wonder if the results could be better
> if we use some more aggressive code generation options, e.g. if you could
> perhaps rerun the benchmarks with -march=native compilation option? This
> seems unlikely, but maybe the compiler generates some very suboptimal SSE
> code because it keeps compatibility with some very old micro-architectures
> by default?

Architecture: x86_64-pc-linux-gnu; compiler: gcc-11.2.0

command:
  $make $coefficiency unit_tests unit_test_targets=financial_test 2>&1 |grep 
form

baseline (HEAD):
  iterator  form: 1.440e-02 s mean;      13512 us least of  70 runs
  container form: 1.361e-02 s mean;      13387 us least of  74 runs

make-clean, emphatically:
  $make raze
change options:
  $sed -i workhorse.make -e's/-O2/-O3/' -e's/-frounding-math/-ffast-math 
-march=native/'
result:
  iterator  form: 1.412e-02 s mean;      13102 us least of  71 runs
  container form: 1.338e-02 s mean;      13105 us least of  75 runs

switch to branch with double in lieu of long double:
  $git switch odd/eraseme_long_double_irr
  M       workhorse.make
  Switched to branch 'odd/eraseme_long_double_irr'
  Your branch is up to date with 'origin/odd/eraseme_long_double_irr'.

verify--yes this command fails:
  $grep 'long double' financial.hpp
raze again:
  $make raze                       
result:
  iterator  form: 1.383e-02 s mean;      13207 us least of  73 runs
  container form: 1.373e-02 s mean;      13179 us least of  73 runs

Thus, TL;DR: "no".

> GC> It would be conceivable to do IRR calculations using expm1() and
> GC> log1p(), but that doesn't seem attractive. The principal part of
> GC> the calculation is evaluation of NPV, the inner product of a stream
> GC> of values ("cash flows") and a vector of powers (1+i)^n, n=0,1,2...
> 
>  Shouldn't this calculation be vectorizable then? I'm sorry, I didn't look
> at the details yet, but if there is any chance of being able to vectorize a
> loop, it would be worth doing it as this could result in really spectacular
> gains.

You needn't look at the details in lmi; the question is simply
whether a polynomial evaluation can be vectorized, whether using
Horner's rule or some other algorithm.

I haven't looked for my copy of TAOCP yet, but I have tried
FMA, and that will be the topic of a separate message.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [lmi] A use case for long double, Greg Chicares <=

Next by Date: [lmi] fma() is not a speed optimization: self-documenting patch
Next by thread: [lmi] fma() is not a speed optimization: self-documenting patch
Index(es):
- Date
- Thread