[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] [lmi-commits] master 7dd2680 14/14: Add and use a forward-summ
From: |
Vadim Zeitlin |
Subject: |
Re: [lmi] [lmi-commits] master 7dd2680 14/14: Add and use a forward-summation function template |
Date: |
Sun, 21 Feb 2021 00:00:00 +0100 |
On Sat, 20 Feb 2021 21:38:00 +0000 Greg Chicares <gchicares@sbcglobal.net>
wrote:
GC> On 2/20/21 2:49 PM, Vadim Zeitlin wrote:
GC> > On Thu, 18 Feb 2021 12:03:42 -0500 (EST) Greg Chicares
<gchicares@sbcglobal.net> wrote:
GC> >
GC> > GC> branch: master
GC> > GC> commit 7dd2680044d48d794d1e68e087d0795ea70b2525
GC> > GC> Author: Gregory W. Chicares <gchicares@sbcglobal.net>
GC> > GC> Commit: Gregory W. Chicares <gchicares@sbcglobal.net>
GC> > GC>
GC> > GC> Add and use a forward-summation function template
GC> > GC>
GC> > GC> Incidentally, this commit will make it simpler to
GC> > GC> s/partial_sum/inclusive_scan/
GC> > GC> once gcc is upgraded from version 8 for lmi production.
GC> >
GC> > I'm curious if using inclusive_scan() rather than partial_sum() can
change
GC> > the results in practice. I know that in theory it could, due to the
GC> > floating point addition operation not being associative, but I haven't
GC> > actually tried confirming this experimentally and so I don't really know
if
GC> > there could be a noticeable effect.
GC>
GC> Suppose I use a std::inclusive_scan overload that doesn't specify
GC> an execution policy. Is that equivalent to specifying the
GC> "sequenced" policy?
To take the question literally, the answer is "definitely no": using any
parallel algorithm is definitely different from using the corresponding
non-parallel algorithm without any policy, if only because of a drastically
different approach to exceptions: throwing them during parallel algorithm
execution results in calling std::terminate(), which is, of course, not the
case for the non-parallel algorithms.
But answering the question you probably wanted to ask, i.e. which
execution policy is used by the non-parallel algorithm, is more difficult
and I couldn't find the definitive answer to it neither. I _believe_ that
the overload not taking the execution policy might be using unsequenced
policy by default, i.e. enable auto-vectorization, but I'm not sure at all
about this. It seems "obvious" that parallel, let alone parallel
unsequenced, policy can't be used implicitly by default because this can
trivially break code working with sequenced policy, but I couldn't even
find where is this written neither.
GC> If (as I infer, but cannot substantiate by citing the standard)
GC> they are equivalent, then inclusive_scan with an elided (default)
GC> execution policy should be equivalent to partial_sum.
Just to be clear, the theoretical problem I was referring to was the
non-associativity of floating point addition. I.e. changing the order of
operations may change the result, although I still don't know if it can
change it significantly.
GC> And in that case, we should s/partial_sum/inclusive_scan/
GC> everywhere: inclusive_scan is preferable because it offers the
GC> option of using parallelism, which we can later investigate, and
GC> then exploit or not.
[...]
GC> The relevant vectors in lmi are almost always of length [0,100].
[...]
GC> Even for such a small length, parallelism might provide a
GC> worthwhile speed improvement.
Thanks for answering my question about the vectors size. Now I'm certain
that using parallel policies is never worthwhile for the vectors of such
size. There might be a tiny benefit of using auto-vectorization for them,
but they're far too small for it to count. IOW I don't think it's
interesting to explore using inclusive_scan() here or worry about its
execution policies, this is never going to change anything in practice here
anyhow speed-wise (but it might still conceivably change the result...).
Regards,
VZ
pgpMyQvJxUy_l.pgp
Description: PGP signature