lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] [lmi-commits] master f1ec209 1/9: Demonstrate that PETE has a


From: Greg Chicares
Subject: Re: [lmi] [lmi-commits] master f1ec209 1/9: Demonstrate that PETE has a non-zero overhead
Date: Sun, 4 Apr 2021 13:12:56 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0

On 4/4/21 11:30 AM, Vadim Zeitlin wrote:
> On Sun,  4 Apr 2021 07:00:19 -0400 (EDT) Greg Chicares 
> <gchicares@sbcglobal.net> wrote:
> 
> GC> branch: master
> GC> commit f1ec2099b587c101b95a3062ac2601acf77e7af0
> GC> Author: Gregory W. Chicares <gchicares@sbcglobal.net>
> GC> Commit: Gregory W. Chicares <gchicares@sbcglobal.net>
> GC> 
> GC>     Demonstrate that PETE has a non-zero overhead
> GC>     
> GC>     For each_equal(), it's about half as fast as plain C++.

Earlier revisions in that unit test had seemed to suggest that PETE
had no real overhead, for more complex cases such as asserting that
a vector<double> has only boolean-valued elements. That seemed too
good to be true, so I wondered whether this simpler test would have
any overhead. It did, and that was important to memorialize.

>  This seems amazingly poor, but I guess the vectors here might be small
> enough that there is a non-negligible overhead from just the extra
> functions calls involved when using PETE?

One hundred nanoseconds versus fifty: the ratio is poor, but the
absolute difference is still only 50 usec. (Here, I'm measuring
only 64-bit pc-linux-gnu.)

You're right: the crucial parameter here is the length of the
vectors 'iv1' and 'iv2'--call it N, which is fifty in that test.
With N=200, the speed ratio is about 1.6 : 1 instead of double.
With N=10 or N=2, it's only a little worse than double.

Veldhuizen's blitz++ documentation that used to live at oonumerics.org
showed extensive performance graphs that were broadly in accord with
our findings here, IIRC: a significant overhead for small N, which
vanishes asymptotically for N >> 100.

>  I guess it's not worth looking into this further, as you would have
> probably indicated if it were, but I didn't expect the abstraction penalty
> to be so high here.

'expression_template_0_test' tests this expression:
  Z += X - 2.1 * Y;
for a wide range of array lengths. Filtered results, with ratio% of
the "mean" values added (below) seem broadly consistent with what we
saw above: for small N, PETE's overhead might be fifty or a hundred
percent, but for N>1000 it seems to vanish.

  Speed tests: array length 1
  C               : 5.879e-07 s mean;          0 us least of 17009 runs 100%
  valarray        : 4.040e-07 s mean;          0 us least of 24755 runs  69%
  PETE            : 8.307e-07 s mean;          0 us least of 12038 runs 141%

  Speed tests: array length 10
  C               : 1.008e-06 s mean;          1 us least of 9920 runs  100%
  valarray        : 1.063e-06 s mean;          1 us least of 9405 runs  105%
  PETE            : 1.958e-06 s mean;          1 us least of 5107 runs  194%

  Speed tests: array length 100
  C               : 1.027e-05 s mean;         10 us least of 974 runs   100%
  valarray        : 1.042e-05 s mean;         10 us least of 960 runs   101%
  PETE            : 1.269e-05 s mean;         12 us least of 789 runs   124%

  Speed tests: array length 1000
  C               : 9.639e-05 s mean;         96 us least of 104 runs   100%
  valarray        : 7.291e-05 s mean;         72 us least of 138 runs    76%
  PETE            : 8.319e-05 s mean;         82 us least of 121 runs    86%

  Speed tests: array length 10000
  C               : 7.900e-04 s mean;        763 us least of 100 runs   100%
  valarray        : 7.734e-04 s mean;        763 us least of 100 runs    98%
  PETE            : 8.377e-04 s mean;        829 us least of 100 runs   106%


reply via email to

[Prev in Thread] Current Thread [Next in Thread]