Re: [lmi] Using auto-vectorization

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Using auto-vectorization

From:	Greg Chicares
Subject:	Re: [lmi] Using auto-vectorization
Date:	Wed, 25 Jan 2017 11:07:51 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0

On 2017-01-24 17:14, Vadim Zeitlin wrote:
> On Tue, 24 Jan 2017 04:11:10 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> On 2017-01-24 02:49, Vadim Zeitlin wrote:
> GC> [...]
> GC> > ET-based code seems to profit from auto-vectorization just
> GC> > as well as everything else, so I don't see any reason to use anything 
> else,
> GC> > especially if the code clarity and simplicity are the most important
> GC> > criteria.
> GC> > 
> GC> >  Now, whether using the particular PETE library is the best choice in 
> 2017
> GC> > is another question and I suspect that it isn't, but I'm not aware of 
> any
> GC> > critical problems with it neither.
> GC> 
> GC> It seems that there was a flurry of interest around the turn of the
> GC> century, but almost none since then. The audience for ET libraries is
> GC> relatively small, and I'd guess that most potential users chose a
> GC> library long ago and aren't interested in changing.
> 
>  There are still a few actively developed libraries built on ET, e.g. Eigen
> (http://eigen.tuxfamily.org/) or Armadillo (http://arma.sourceforge.net/)

Both seem to be MPL2:
  https://www.gnu.org/licenses/license-list.en.html#GPLCompatibleLicenses
Maybe we should take a look at them someday.

> and, of course, some older libraries such as Boost.uBLAS are still much
> newer than PETE.

Our universe is vectors of length one to one hundred, the most typical
length being about fifty; in that range, uBLAS is slow:

Running expression_template_0_test:
  Speed tests: array length 1
  C               : 3.109e-007 s =        311 ns, mean of 32169 iterations
  valarray        : 1.686e-007 s =        169 ns, mean of 59310 iterations
  uBLAS           : 3.273e-007 s =        327 ns, mean of 30553 iterations
  PETE            : 1.680e-007 s =        168 ns, mean of 59539 iterations

  Speed tests: array length 10
  C               : 1.764e-007 s =        176 ns, mean of 56685 iterations
  valarray        : 1.782e-007 s =        178 ns, mean of 56134 iterations
  uBLAS           : 3.407e-007 s =        341 ns, mean of 29357 iterations
  PETE            : 1.779e-007 s =        178 ns, mean of 56222 iterations

  Speed tests: array length 100
  C               : 2.786e-007 s =        279 ns, mean of 35892 iterations
  valarray        : 2.799e-007 s =        280 ns, mean of 35728 iterations
  uBLAS           : 4.564e-007 s =        456 ns, mean of 21912 iterations
  PETE            : 2.776e-007 s =        278 ns, mean of 36027 iterations

It's about half the speed of valarray or PETE; it's even slower than the
"STL fancy" test:

///    v2 += v0 - 2.1 * v1;
    std::transform
        (sv0b.begin(),sv0b.end(),sv1b.begin(),tmp0.begin()
        ,std::bind
            (std::minus<double>()
            ,std::placeholders::_1
            ,std::bind(std::multiplies<double>(),std::placeholders::_2,2.1) ) );
    std::transform
        (sv2b.begin(),sv2b.end(),tmp0.begin(),sv2b.begin()
        ,std::plus<double>() );

For N=10000, uBLAS beats "STL fancy", but not by much. It might be
great for linear algebra, but for our purposes it's not suitable.

> GC> >  So, I guess, I'm still not sure what, if anything, should be done 
> here? I
> GC> > can spend a lot of time profiling/benchmarking/debugging and it probably
> GC> > will result in at least some useful insights, but I can't propose any
> GC> > syntax better than the current ET-based one and so I'm still not sure 
> what
> GC> > is my goal here.
> GC> 
> GC> I think we're done for now. We aren't likely to find anything that
> GC> outperforms PETE. We can make greater use of it as time permits.
> 
>  Yes, I agree with this. However I think that you might still want to
> consider switching to -O3 (or adding just -ftree-vectorize?) as it seems to
> result in a "free" performance gain.

Maybe I should add a makefile target for the special purpose of testing
lmi's overall speed. Until then, this is probably a good test: run a
single census with 184 cells. Here, I used '--emit=emit_nothing' to
emphasize calculations, which are more likely to be helped by '-O3'
than report generation. The first set of timings use the '-02' binary
we would distribute today; the second set uses '-03' instead, but
otherwise all flags are the same.

/opt/lmi/src/lmi[0]$time wine 
/opt/lmi/src/build/lmi/Linux/gcc/ship/lmi_cli_shared.exe 
--file=/opt/lmi/test/sample.cns --accept --ash_nazg --data_path=/opt/lmi/data 
--emit=emit_nothing >/dev/null

"-O2"
12.36s user 0.30s system 96% cpu 13.059 total
11.72s user 0.34s system 96% cpu 12.447 total
12.35s user 0.30s system 96% cpu 13.050 total

/opt/lmi/src/lmi[0]$time wine 
/opt/lmi/src/build/lmi/Linux/gcc/fastest/lmi_cli_shared.exe 
--file=/opt/lmi/test/sample.cns --accept --ash_nazg --data_path=/opt/lmi/data 
--emit=emit_nothing >/dev/null

"-O3"
11.60s user 0.30s system 96% cpu 12.264 total
11.53s user 0.31s system 96% cpu 12.231 total
12.32s user 0.30s system 96% cpu 13.026 total

It doesn't seem to make a large difference. Perhaps it did for you
with 64-bit builds, or at least with SSE?

I tried increasing the priority, but...

/opt/lmi/src/lmi[127]$time sudo nice --10 wine 
/opt/lmi/src/build/lmi/Linux/gcc/fastest/lmi_cli_shared.exe 
--file=/opt/lmi/test/sample.cns --accept --ash_nazg --data_path=/opt/lmi/data 
--emit=emit_nothing >/dev/null 
wine: created the configuration directory '/root/.wine'
No protocol specified
Application tried to create a window, but no driver could be loaded.
Make sure that your X server is running and that $DISPLAY is set correctly.
/opt/lmi/src/lmi[53]$sudo rm -rf /root/.wine

...apparently 'wine' needs to create a hidden window.

Repeating the tests as above, but a few hours later and with a
slightly different phase of the moon:

"-O2"
12.22s user 0.31s system 97% cpu 12.909 total
11.98s user 0.32s system 96% cpu 12.683 total
12.14s user 0.30s system 96% cpu 12.872 total

"-O3"
12.32s user 0.31s system 97% cpu 13.018 total
11.97s user 0.34s system 96% cpu 12.712 total
12.22s user 0.31s system 97% cpu 12.913 total

Now, comparing the "total" vectors element by element,
"-O2" is faster than "-O3" in each of the three pairs,
but the differences are not significant.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [lmi] Replacing boost with std C++11, (continued)

Prev by Date: Re: [lmi] Using auto-vectorization
Next by Date: Re: [lmi] [lmi-commits] master 6335e9a 1/4: Render comprehensible
Previous by thread: Re: [lmi] Using auto-vectorization
Next by thread: Re: [lmi] Replacing boost with std C++11
Index(es):
- Date
- Thread