[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gcl-devel] Re: gzipped tar file on profiling

From: Matt Kaufmann
Subject: Re: [Gcl-devel] Re: gzipped tar file on profiling
Date: Fri, 2 Jan 2004 12:33:05 -0600

Hi, Camm --

Thanks for your email.  Comments are interspersed below.

   Cc: address@hidden, address@hidden, address@hidden,
   From: Camm Maguire <address@hidden>
   Date: 02 Jan 2004 13:15:14 -0500
   User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
   Content-Type: text/plain; charset=us-ascii


   Matt Kaufmann <address@hidden> writes:

   > Thank you, Camm.  We already do some fixnum declaration and proclaiming in 
   > for GCL, but I can imagine that we could benefit from doing some more.  

   OK, I was under the impression that the code you were running for AMD
   was not stock ACL2 but something custom.  If there is an analogous
   benchmark to what you've reported that can be run under stock ACL2,
   that of course would facilitate the performance analysis greatly.

I was running stock ACL2 (well, actually a development version, but I don't
think the differences were significant) but with ACL2 input files that I
unfortunately can't send outside AMD.  By the way, at one time I timed our
rather extensive regression suite in GCL vs. Allegro and GCL won (though it was
close), so there seems to be something about those AMD input files in
particular that are causing lots of GC; maybe the sheer size.  I'll keep my
eyes open for publicly available examples where GCL is slow.

   > try to do that at some point.  We can't just proclaim all integers to be
   > fixnums, however, in case that is what sys-proclaim.lsp is intended to do. 

   No, the sys-proclaim.lsp basically collects certain information
   gathered by the compiler in the form of a function proclamation.  Even
   where you have declared a fixnum, the compiler will promote it to a
   full lisp object if its passed as an argument to a function without a
   proclamation (stating that the argument is a fixnum).

Do you expect that using sys-proclaim.lisp would speed up performance even with
our custom proclaiming?  If so, is there some tradeoff -- if not, then why
doesn't GCL do the sys-proclaim.lisp stuff automatically?

   > do our own proclaiming automatically; roughly speaking, we use declare 
forms of
   > a defun for input types and THE forms for output types (but with 
   > as necessary).  Our top-level function that does this is function 
   > in source file acl2-fns.lisp, in case you're interested.

   I am indeed interested and will check it out -- thanks!

Please let me know if I can help you make sense out of our code.

   > Is there some way that make_fixnum could avoid creating new fixnum objects 
   > than once for each fixnum up to some limit, and also avoids garbage 
   > such objects?  Maybe there is already such a mechanism in GCL and the limit
   > merely needs to be increased, in order to fix most of this performance 
   > (since you say that GC time is the problem, presumably rather than
   > fixnum-building time).  What do you think?

   Right now, make_fixnum returns a constant non-GC'ed fixnum for
   integers between +- 1024.  If you would like to experiment with this
   setting, you can change the value in object.h defined currently as

   object.h:111:#define SMALL_FIXNUM_LIMIT      1024

   However, I really suspect that something like a heavily used loop
   counter is not being compiled to a machine integer when it might in
   principle be trivially so represented.  It would be very helpful if a
   representative such function could be identified, in either stock acl2
   or your custom stuff, (i.e. a heavily-used function which shows calls
   to make_fixnum when run through (disassemble...)) -- its possible that
   the compiler's determination of machine fixnum representations can be
   improved, given the difference in performance with allegro.

The problem is that our ACL2 sources are over 7 megabytes, and I don't know how
to narrow the search.  Any suggestions would be welcome.

   Of course, we could also (eventually) take the ecl approach -- halve
   or quarter the fixnum range and represent *all* fixnums as machine
   integers with a flag bit set.  I'm not sure what possible negative
   consequences could arise from restricting the fixnum range
   (i.e. increasing the bignum range).

Of course, there may be some current GCL users who would scream at losing some

   Take care,

   > Thanks --
   > -- Matt
   >    Cc: address@hidden, address@hidden, address@hidden,
   >       address@hidden
   >    From: Camm Maguire <address@hidden>
   >    Date: 30 Dec 2003 17:11:42 -0500
   >    User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
   >    Content-Type: text/plain; charset=us-ascii
   >    Greetings!  Just a preliminary note here -- the overwhelming majority
   >    of the gbc cost with or without sgc is in collecting fixnum garbage.
   >    Furthermore, the vast majority of this is generated in your compiled
   >    lisp code.  
   >    As you probably know, if GCL can figure out that an integer is a
   >    fixnum, usually via proclaims or declares, it will compile it directly
   >    to a machine long integer avoiding any memory allocation.  Otherwise,
   >    it calls the allocating routine make_fixnum.  
   >    At present, the only known (to me) system macro which often
   >    unnecessarily allocates fixnums is dotimes -- if your code uses this a
   >    lot let me know and I'll send you a replacement to try.  
   >    Otherwise these massive calls to make_fixnum are likely in code
   >    written by the compiler.  You likely know which functions of yours
   >    take the most time.  You can look for calls to make_fixnum via the
   >    (disassemble...)  function which will output the generated C code.  A
   >    simple (declare (fixnum i)) or similar where appropriate will likely
   >    eliminate most of the runtime.  
   >    To be a bit more specific, to avoid any allocation, the compiler needs
   >    to know both that the bounds of the integer are appropriate (which can
   >    be gleaned from the (declare...) statement), and that any functions
   >    using the variable as an argument have been compiled to take machine
   >    integer arguments (which information typically comes from
   >    (proclaims...)).  The compiler will "promote" a machine integer to a
   >    lisp object via allocation if the integer is passed to a function
   >    taking lisp objects as arguments.  You also probably know about the
   >    automatic way of making a sys-proclaim.lsp for use by the compiler
   >    in this regard.  Briefly, just load gcl_collectfn.o and execute
   >    (compiler::emit-fn t) before compiling (generating .fn files for each
   >    .lisp file), and then (compiler::make-all-proclaims "*.fn").  Then
   >    load sys-proclaim.lsp before recompiling, or put this load command in
   >    a gcl_cmpinit.lsp file in the source directory.  All this needs to be
   >    automated in my opinion.
   >    The only current drawback of gprof profiling is that compiled lisp
   >    code as conventionally loaded into the program's .data section is all
   >    lumped together in one entry labeled <hicore>.  I'll be working on
   >    improving this at some point.  For now, you can build your final
   >    executable in an alternate way which will retain the identities of the
   >    compiled lisp functions for the purposes of profiling should you be
   >    unable to identify the culprit yourself.  Basically you set the
   >    variable compiler::*default-system-p* to t before compiling, and
   >    instead of using (si::save-system...), do
   >    (compiler::link
   >       (list "module1.o" "module2.o" ...)
   >       "new_image_name"
   >       "(progn (lisp-code-to-run-after-loading-and-before-saving))")
   >    Here is how I do this for acl2 (this method is alas required on some
   >    machines:) 
   >    BINARIES:=acl2-fns.o axioms.o basis.o translate.o type-set-a.o 
type-set-b.o rewrite.o simplify.o \
   >         bdd.o other-processes.o induct.o history-management.o prove.o 
defuns.o proof-checker-a.o\
   >         defthm.o other-events.o ld.o proof-checker-b.o tutorial.o 
interface-raw.o \
   >         linear-a.o linear-b.o non-linear.o TMP1.o
   >    BIN:=$(patsubst %,"%",$(BINARIES))
   >    INIT:=(initialize-acl2 (quote include-book) *acl2-pass-2-files* t t)
   >    POST:=(load \"foo.lsp\")\
   >     (setq compiler::*default-system-p* nil)\
   >     (in-package \"acl2\")\
   >     (save-acl2 (quote (initialize-acl2 (quote include-book) 
*acl2-pass-2-files* t)) \"saved_acl2\"))
   >    nsaved_acl2: foo.lsp
   >       echo '(load "foo.lsp")(in-package "acl2")(compile-acl2)' | gcl
   >       echo '(load "foo.lsp")(in-package "acl2")(load-acl2)$(INIT)' | gcl
   >       echo '(compiler::link (list $(BIN)) "$@" "$(POST)" "" nil)' |gcl
   >    I'd be interested in knowing any results, as my current guess is that
   >    Allegro can infer a fixnum where we cannot at present.  If you can
   >    give me the details of such misinference, I can try to fix it.
   >    I would have guessed that you were doing a lot of array indexing, but
   >    this appears not to be the case.
   >    Take care, 
   >    Matt Kaufmann <address@hidden> writes:
   >    > Camm --
   >    > 
   >    > The run has completed.  It seemed to take much longer without SGC.  
I'll send
   >    > you the uuencoded gzipped tar file (509K) now (see the README after 
you expand
   >    > it).  I'll spare those CCed on this email from that file.
   >    > 
   >    > -- Matt
   >    >    cc: address@hidden, address@hidden, address@hidden,
   >    >          address@hidden
   >    >    From: Camm Maguire <address@hidden>
   >    >    Date: 17 Dec 2003 17:11:44 -0500
   >    >    User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
   >    >    Content-Type: text/plain; charset=us-ascii
   >    > 
   >    >    Thanks Matt!  I got it now.  I wonder if you had also saved the gbc
   >    >    output -- if I recall you usually run with *notify-gbc* turned on.
   >    >    Also, a (room) before and after would be of interest, as well as 
   >    >    effect of (si:sgc-on nil).
   >    > 
   >    >    On the one hand, GBC is clearly the performance culprit, and that 
   >    >    very good news, as it is relatively easy to address with a better
   >    >    memory layout.  What is odd about your profiling though is that
   >    >    marking is taking up the lion's share of the time.  In normal GBC, 
   >    >    the sweeper that costs, at least in the tests I've run so far.  
And I
   >    >    think I know why, it is the following extra pass that's required 
   >    >    sgc marking:
   >    > 
   >    >      /* mark all non recent data on writable pages */
   >    >      {
   >    >        int t,i=page(heap_end);
   >    >        struct typemanager *tm;
   >    >        char *p;
   >    > 
   >    >        while (--i >= 0) {
   >    >        if (WRITABLE_PAGE_P(i)
   >    >            && (t=type_map[i]) < (int) t_end)
   >    >          ;
   >    >        else 
   >    >          continue;
   >    >        tm=tm_of(t);
   >    >        p=pagetochar(i);
   >    >        if ( t == t_cons) 
   >    >          for (j = tm->tm_nppage; --j >= 0; p += 
tm_table[t_cons].tm_size/*  sizeof(struct cons) */) {
   >    >            object x = (object) p; 
   >    >            if (SGC_OR_M(x)) 
   >    >              continue;
   >    >            if (x->d.t==t_cons) {
   >    >              x->d.m = TRUE; 
   >    >              sgc_mark_cons(x);
   >    >            } else
   >    >              sgc_mark_object1(x);
   >    >          }
   >    >        else {
   >    >          int size=tm->tm_size;
   >    >          for (j = tm->tm_nppage; --j >= 0; p += size) {
   >    >            object x = (object) p; 
   >    >            if (SGC_OR_M(x)) continue;
   >    >            sgc_mark_object1(x);
   >    >          }
   >    >        }
   >    >        }
   >    >      }
   >    > 
   >    > 
   >    >    I'll have to think a bit about why this was put in, but in short, 
   >    >    need not be a performance boost if most of your memory is writable
   >    >    anyway, and might (speculation) even be less efficient.  This 
   >    >    to that note I sent earlier about the enabling of SGC in ACL2, 
   >    >    you asked me to clarify, and I never did, as I don't yet have a 
   >    >    answer as to a metric to determine when SGC helps and when it does
   >    >    not.  One thing should be clear, and that is many heap expansions,
   >    >    which require an (si::sgc-on nil)(gbc t)(si::sgc-on t), are very
   >    >    expensive, as these steps require that the read-only subset of 
   >    >    be determined all over again.
   >    > 
   >    >    To be a bit more verbose, SGC works by taking the memory set at 
   >    >    of (si:sgc-on t), making it read-only, and allocating new writable
   >    >    pages for use in subsequent allocation -- these are "SGC" pages,
   >    >    i.e. the memory set is divided into two.  Write attempts on the old
   >    >    memory make those pages writable, but they are not "SGC" pages as 
   >    >    as the algorithm goes, and must be garbage collected separately.  
   >    >    idea is to enable sgc at the point where the most memory that will
   >    >    essentially have no further modifications can be set aside.  
   >    >    aside a small amount of truly read-only memory, or a lot of memory
   >    >    which has significant subsequent modifications, both suffer from
   >    >    inefficiencies.
   >    > 
   >    >    >From looking at b.out, for example, half the time is spend in
   >    >    sgc_mark_phase itself, and half in its main child process,
   >    >    sgc_mark_object1.  This suggests to me that the ostensibly 
   >    >    memory which was actually modified and thus made writable (writable
   >    >    non-SGC pages) is substantial.  The loop shown above is similar to
   >    >    that in the sweeper which is the main cost in normal GBC, so the 
   >    >    memory it has to process, the worse performance will be.
   >    > 
   >    >    Anyway, these are just initial thoughts.  I'd like Vadim's input if
   >    >    possible to construct for this case an analog of the maxima init 
   >    >    he's just put together, so we can get a performance comparison with
   >    >    ACL short of GBC.  This in turn will help focus our efforts on 
   >    >    our out of the box algorithms and parameter settings can be 
   >    > 
   >    >    Take care, 
   >    > 
   >    > 
   >    >    Matt Kaufmann <address@hidden> writes:
   >    > 
   >    >    > Camm --
   >    >    > 
   >    >    > In case email from AMD to you is problematic, I've also just now 
put the
   >    >    > profile results on the web as a gzipped tar file:
   >    >    > 
   >    >    > /u/www/users/moore/acl2/seminar/temp/camm.tar.gz
   >    >    > or
   >    >    > 
   >    >    > 
   >    >    > When you let me know you've received it, I'll delete this file.
   >    >    > 
   >    >    > -- Matt
   >    >    > 
   >    >    > 
   >    >    > 
   >    > 
   >    >    -- 
   >    >    Camm Maguire                                               
   >    >    
   >    >    "The earth is but one country, and mankind its citizens."  --  
   >    > 
   >    > 
   >    > _______________________________________________
   >    > Gcl-devel mailing list
   >    > address@hidden
   >    > http://mail.gnu.org/mailman/listinfo/gcl-devel
   >    > 
   >    > 
   >    > 
   >    -- 
   >    Camm Maguire                                            address@hidden
   >    "The earth is but one country, and mankind its citizens."  --  

   Camm Maguire                                         address@hidden
   "The earth is but one country, and mankind its citizens."  --  Baha'u'llah

Thanks --
-- Matt

reply via email to

[Prev in Thread] Current Thread [Next in Thread]