octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: performance improvements


From: Rik
Subject: Re: performance improvements
Date: Tue, 3 Sep 2019 17:14:55 -0700

On 09/03/2019 03:39 PM, John W. Eaton wrote:
> On 8/30/19 12:42 AM, Rik wrote:
>> On 08/29/2019 06:43 PM, John W. Eaton wrote:
>
>>> I spent a lot of time over the last week looking at the interpreter. It's
>>> a bunch of different things, but the biggest factor seems to be using the
>>> tree walker pattern instead of just doing virtual dispatch to evaluation
>>> methods in the parse tree objects themselves.  I think I can speed it up
>>> considerably, possibly even get back much closer to the 3.4.3 performance
>>> just by going back to something more like what we used to have for the
>>> evaluator (but keeping evaluator state in an object instead of as global
>>> data).  Another option is to implement a byte code interpreter instead of
>>> directly evaluating the parse tree, but that is more work.  OTOH, it
>>> might be instructive and helpful for JIT compiling.
>>
>> This sounds exciting.  I would probably proceed conservatively, i.e.,
>> recover past performance by reverting to the previous code pattern, before
>> embarking on JIT which could be a very big project indeed.
>
> I pushed the following changes:
>
>   http://hg.savannah.gnu.org/hgweb/octave/rev/5bf76ab4cce3
>   http://hg.savannah.gnu.org/hgweb/octave/rev/a2d3fa82b730
>   http://hg.savannah.gnu.org/hgweb/octave/rev/fcaecdbc8d8a
>
> The last changeset is the most significant and reverts the changes made
> for version 4.4 to use the visitor pattern for expression evaluation. As
> I note in the commit message, although it is desirable to have all parse
> tree evaluation functions grouped together in a single file, using the
> visitor pattern can be inefficient, especially when the visitor function
> is small and the extra levels of indirection and virtual function
> resolution can take more time than the evaluation function itself
> (evaluation of constants, for example).  That's a big reason for the huge
> hit in performance when the loop contains only a simple assignment of a
> constant value.

This seems like classic performance versus code readability /
maintainability tradeoff.  I agree that in this case it seems better to
give up the "elegance" of the visitor pattern for the increase in speed. 
Using the original bm.toeplitz.orig.m benchmark the dev branch used to take
13.3 seconds and now takes just 7.7, -42%.  For historical comparison, this
puts us between versions 3.8.2 (5.6 s) and 4.0.3 (10.1 s).  The winner is
still 3.4.3 at 4.5 s.

>
> I'm not proposing a JIT compiler as an immediate fix.  My suggestion of a
> byte compiler was to walk the parse tree once and emit a list of simple
> instructions that could be evaluated without having to recursively walk
> the parse tree again.  I suspect that would be more efficient, but it
> would also require a significant amount of work.  And if we are going
> this route, maybe we should be looking for an existing virtual machine to
> use (JVM, even) instead of creating our own from scratch.
>
> Another simpler thing that would provide immediate benefit would be to
> improve or replace the unwind_protect objects we use to save and restore
> the interpreter state.  Currently, the unwind_protect class uses a stack
> to store actions to perform when the unwind_protect object goes out of
> scope.  It also uses virtual functions to choose the specific type of
> action to perform.  That's nice and generic, but in many cases, we just
> want to save and restore a single value and the overhead is huge.  I
> added some counting to the unwind_protect class and found that when the
> actions are performed, we often have just a few objects to clean up or
> values to restore.  Here are the stats I collected when running "make
> check":
>
>   size   instances
>     9           8
>     7          15
>     6         101
>     5     1358693
>     3      736350
>     2       40139
>     1    12775752
>     0     8170259
>

Normalizing the results, size 0 is 35% of the total and size 1 is 55% of
the total.  Dealing with both of those would cover 90% of the code paths. 
At least it's nice to have the data following the Pareto principle.  I
wonder if there is a commonality to what the one value we frequently store
is?  If so, maybe we should just always keep it.

Finally, I suspect you're right about the unwind_protect blocks.  When I
ran with perf there were indications in the stack that unwind_protect was
consuming some time, but it was hard to figure out exactly how it was
happening.

> So in the vast majority of cases, we only have one action to perform, so
> creating a stack and using virtual function dispatch doesn't make much
> sense.  We could speed this up a lot just by defining a few special case
> classes that save and restore single values or call a single function
> (and with lambda capture, we just need a single simple interface).  I'll
> take a look at doing that soon.
>

I'll re-run under perf to see what the new code hotspots are.  By knocking
down the overall running time we will get a magnification effect for the
remaining issues.

--Rik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]