[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] [lmi-commits] master f1ec209 1/9: Demonstrate that PETE has a
From: |
Greg Chicares |
Subject: |
Re: [lmi] [lmi-commits] master f1ec209 1/9: Demonstrate that PETE has a non-zero overhead |
Date: |
Sun, 4 Apr 2021 13:12:56 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 |
On 4/4/21 11:30 AM, Vadim Zeitlin wrote:
> On Sun, 4 Apr 2021 07:00:19 -0400 (EDT) Greg Chicares
> <gchicares@sbcglobal.net> wrote:
>
> GC> branch: master
> GC> commit f1ec2099b587c101b95a3062ac2601acf77e7af0
> GC> Author: Gregory W. Chicares <gchicares@sbcglobal.net>
> GC> Commit: Gregory W. Chicares <gchicares@sbcglobal.net>
> GC>
> GC> Demonstrate that PETE has a non-zero overhead
> GC>
> GC> For each_equal(), it's about half as fast as plain C++.
Earlier revisions in that unit test had seemed to suggest that PETE
had no real overhead, for more complex cases such as asserting that
a vector<double> has only boolean-valued elements. That seemed too
good to be true, so I wondered whether this simpler test would have
any overhead. It did, and that was important to memorialize.
> This seems amazingly poor, but I guess the vectors here might be small
> enough that there is a non-negligible overhead from just the extra
> functions calls involved when using PETE?
One hundred nanoseconds versus fifty: the ratio is poor, but the
absolute difference is still only 50 usec. (Here, I'm measuring
only 64-bit pc-linux-gnu.)
You're right: the crucial parameter here is the length of the
vectors 'iv1' and 'iv2'--call it N, which is fifty in that test.
With N=200, the speed ratio is about 1.6 : 1 instead of double.
With N=10 or N=2, it's only a little worse than double.
Veldhuizen's blitz++ documentation that used to live at oonumerics.org
showed extensive performance graphs that were broadly in accord with
our findings here, IIRC: a significant overhead for small N, which
vanishes asymptotically for N >> 100.
> I guess it's not worth looking into this further, as you would have
> probably indicated if it were, but I didn't expect the abstraction penalty
> to be so high here.
'expression_template_0_test' tests this expression:
Z += X - 2.1 * Y;
for a wide range of array lengths. Filtered results, with ratio% of
the "mean" values added (below) seem broadly consistent with what we
saw above: for small N, PETE's overhead might be fifty or a hundred
percent, but for N>1000 it seems to vanish.
Speed tests: array length 1
C : 5.879e-07 s mean; 0 us least of 17009 runs 100%
valarray : 4.040e-07 s mean; 0 us least of 24755 runs 69%
PETE : 8.307e-07 s mean; 0 us least of 12038 runs 141%
Speed tests: array length 10
C : 1.008e-06 s mean; 1 us least of 9920 runs 100%
valarray : 1.063e-06 s mean; 1 us least of 9405 runs 105%
PETE : 1.958e-06 s mean; 1 us least of 5107 runs 194%
Speed tests: array length 100
C : 1.027e-05 s mean; 10 us least of 974 runs 100%
valarray : 1.042e-05 s mean; 10 us least of 960 runs 101%
PETE : 1.269e-05 s mean; 12 us least of 789 runs 124%
Speed tests: array length 1000
C : 9.639e-05 s mean; 96 us least of 104 runs 100%
valarray : 7.291e-05 s mean; 72 us least of 138 runs 76%
PETE : 8.319e-05 s mean; 82 us least of 121 runs 86%
Speed tests: array length 10000
C : 7.900e-04 s mean; 763 us least of 100 runs 100%
valarray : 7.734e-04 s mean; 763 us least of 100 runs 98%
PETE : 8.377e-04 s mean; 829 us least of 100 runs 106%