guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inlining calls to primitives


From: Ludovic Courtès
Subject: Re: Inlining calls to primitives
Date: Tue, 05 Sep 2006 18:21:29 +0200
User-agent: Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux)

Hi Neil,

Neil Jerram <address@hidden> writes:

> Interesting piece of work.
>
> It seems to me, though, that there are 3 things going on here.
>
> 1. Memoization of global variable references that yield one of a
>    particular subset of common procedures.  (I call this part
>    memoization because it seems similar to the memoization that we
>    already do for syntax like let, begin, and, etc.)
>
> 2. Inlining of the code for these procedures within CEVAL.
>
> 3. Changing IM_SYMs to be dynamic instead of fixed constants, plus the
>    macrology and GCC jump table stuff.
>
> Do you know what the relative contributions of these 3 changes are?

Thanks Neil for clarifying this.  The measurements you propose are
indeed a good idea and the results are not exactly as I was expecting
(which confirms that I'm not very good at predicting performance ;-)).
BTW, imsyms are not assigned dynamically: they are assigned statically
by the `extract-imsyms.sh' script.

I made a series of measurements with Guile compiled with `-pg -O0'.
Then I tried different configurations switching on and off each of these
3 features.  The first table below summarizes the execution time
improvement, looking at the execution time of `every' itself as well as
the execution time of the whole program.

                                       `every'     overall
------------------------------------+----------------------
jump table vs. switch               |    0.8%      -1.4% (worse!)
inlining in `CEVAL ()' vs. funcall  |   11.0%       4.7%

The second table shows improvement compared to the non-memoizing + jump
table version (i.e., with `(eval-disable 'inline)':

memoization + jt + inline       | 32.4%      22.1%
memoization + switch + inline   | 31.9%      23.2%
memoization + jt + funcall      | 24.0%      18.3%

(Beware: I only run each test case 3 times or so so these figures should
not be considered as an ultimate benchmark!  I'm attaching the whole
results for the record.)

In short, the outcome of using a jump table is negligible in this
context (it's really a microoptimization compared to the two other
things).

Function call overhead, however, _is_ important, though only the second
source of improvement.  Repeatedly using function calls to execute a
handful of instructions is costly.  Plus it probably increases cache
misses, things like that.

Now, if we generalized the memoization thing, as you suggested, so that
any procedure could be memoized (based on user annotations), then things
may be a bit different because we would be using indirect function calls
(i.e., like `SCM (*func) () = xxx; return (func (arg));') while in my
measurements I was using immediate function calls (as in `scm_car
(op)').  I should compare indirect and immediate function calls, but I
presume that there is a slight performance difference.

Finally, memoization does indeed play an important role.  I suspect that
it's mostly because, for instance, argument count is only checked at
memoization time, and not when the "inlinable" is actually executed.
Plus the memoization code is pretty local (unlike when `CEVAL ()' has to
go through `evap0', then `evalp1', etc.).

I'm afraid this is kind of a dirty report, but I hope it sheds some
light on the issue.


Also, Rob mentioned on IRC that he was concerned about the global
switch.  I believe this can be fixed using fluids or something like that
so that inlining can be enabled/disabled on a per-module basis (as we
did with `current-reader').  But that will be the topic of another
thread maybe.  ;-)

Thanks,
Ludovic.


-*- Outline -*-

* Summary

                                       `every'     overall
------------------------------------+----------------------
jump table vs. switch               |    0.8%      -1.4% (worse!)
inlining in `CEVAL ()' vs. funcall  |   11.0%       4.7%


Compared to no-memoization + jump table (521 ; 10.16):

memoization + jt + inline       | 32.4%      22.1%
memoization + switch + inline   | 31.9%      23.2%
memoization + jt + funcall      | 24.0%      18.3%


* memoization + jump table + inlined in `CEVAL ()'

** Raw Execution Time

$ time ./pre-inst-guile -l ,,every.scm
ready.
time taken: 352.0

real    0m7.910s
user    0m7.720s
sys     0m0.028s

** Gprof Top

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 46.95      3.54     3.54     1205     0.00     0.01  deval
 11.67      4.42     0.88    30436     0.00     0.00  scm_i_sweep_card
  7.96      5.02     0.60 16794945     0.00     0.00  scm_is_pair
  7.82      5.61     0.59  7114789     0.00     0.00  scm_cell
  7.43      6.17     0.56   185286     0.00     0.00  scm_gc_mark_dependencies
  6.50      6.66     0.49  5530481     0.00     0.00  scm_ilookup
  3.71      6.94     0.28     8737     0.00     0.00  scm_i_init_card_freelist
  1.33      7.04     0.10      723     0.00     0.00  
scm_i_mark_weak_vector_non_weaks


* memoization + switch + inlined in `CEVAL ()'

** Raw Execution Time

$ time ./pre-inst-guile -l ,,every.scm
ready.
time taken: 355.0

real    0m7.803s
user    0m7.724s
sys     0m0.044s

** Gprof Top

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 47.08      3.71     3.71     1205     0.00     0.01  deval
 12.56      4.70     0.99    30436     0.00     0.00  scm_i_sweep_card
  8.50      5.37     0.67 16792815     0.00     0.00  scm_is_pair
  7.30      5.95     0.57  5530481     0.00     0.00  scm_ilookup
  7.23      6.51     0.57   185466     0.00     0.00  scm_gc_mark_dependencies
  6.60      7.04     0.52  7114789     0.00     0.00  scm_cell
  3.55      7.32     0.28     8737     0.00     0.00  scm_i_init_card_freelist
  1.02      7.39     0.08      723     0.00     0.00  scm_i_mark_weak_vector_non
_weaks


* memoization + jump table + no inlining in `CEVAL ()'

** Raw Execution Time

$ time ./pre-inst-guile -l ,,every.scm
ready.
time taken: 396.0

real    0m8.299s
user    0m8.233s
sys     0m0.032s

** Gprof Top

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 44.90      3.44     3.44     1205     0.00     0.01  deval
  9.80      4.18     0.75    30436     0.00     0.00  scm_i_sweep_card
  8.76      4.86     0.67 18293669     0.00     0.00  scm_is_pair
  8.04      5.47     0.61   185273     0.00     0.00  scm_gc_mark_dependencies
  7.32      6.03     0.56  7114789     0.00     0.00  scm_cell
  6.34      6.51     0.48  5530481     0.00     0.00  scm_ilookup
  2.61      6.71     0.20     8737     0.00     0.00  scm_i_init_card_freelist
  1.44      6.83     0.11  2519996     0.00     0.00  scm_list_1

* no memoization + jump table

** Raw Execution Time

$ time ./pre-inst-guile -l ,,every.scm
ready.
time taken: 521.0

real    0m10.163s
user    0m10.097s
sys     0m0.020s

** Gprof Top

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 42.06      4.00     4.00     1206     0.00     0.01  deval
 12.41      5.18     1.18    44288     0.00     0.00  scm_i_sweep_card
  8.62      6.00     0.82 11115394     0.00     0.00  scm_cell
  8.31      6.79     0.79   197701     0.00     0.00  scm_gc_mark_dependencies
  7.47      7.50     0.71 20362535     0.00     0.00  scm_is_pair
  7.41      8.21     0.70  5530482     0.00     0.00  scm_ilookup
  3.15      8.51     0.30    13609     0.00     0.00  scm_i_init_card_freelist
  2.10      8.71     0.20  5520322     0.00     0.00  scm_list_1

reply via email to

[Prev in Thread] Current Thread [Next in Thread]