mit-scheme-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[MIT-Scheme-devel] Rework amd64 ABI -- flag day


From: Taylor R Campbell
Subject: [MIT-Scheme-devel] Rework amd64 ABI -- flag day
Date: Sat, 5 Jan 2019 08:15:57 +0000

Summary:

I would like to merge some incompatible changes to the amd64 compiled
code ABI that substantially improve system performance all around,
especially for closure-heavy code.  It shaves about 20% off the test
suite time (with FAST=y); some workloads run 5x or 10x faster.  The
branch is riastradh-20181220-closentry-v2.

But...this will require a flag day -- you'll have to cross-compile a
new system from an old one, which is a pain, though less of a pain now
that cross-fasdump-v2 is merged.

When would be a good time to do this?


Full story:

Back when I wrote the amd64 back end a decade ago, I made it work, but
I ran out of steam to make it work well, and then we all started using
it so it was a pain to change, so there were some bad decisions
embedded in it.

For example, the x86 representation of closures -- with embedded CALL
instructions to jump to the real code and push the pointer to the data
on the stack -- might have made sense in 1990, but it's about the
worst way imaginable to abuse modern CPUs:

- consing a closure entails writing instruction memory that the first
  invocation will have to load into the instruction cache (and reload
  after every GC),
- the CALL is never paired with a RET so it breaks return address
  branch target prediction,
- and to top it off nearly every unknown procedure call shares a
  single JMP instruction (per arity) in cmpauxmd.s (x86-64.m4) so the
  CPU can't discriminate to refine indirect branch target prediction.

Recently this was bugging me, so I took some time to fix some of the
mistakes I'd made and confirm they give various workloads substantial
performance improvements.

Merging all these changes will require a flag day: we'll have to build
a new system with a cross-compiler, and old coms won't work.  The
cross-fasdump-v2 branch will make that process less painful, but it's
still a flag day.

Here's the changes I'm sitting on that require a flag day:

- Split the compiled-entry type code into compiled-entry and
  compiled-return, so that calling conventions for heap closures and
  stack continuations don't compete on engineering constraints.

  Either tag can still refer to any kind of compiled code:
  expressions, continuations, open or global procedures; this is
  convenient for, e.g., interrupt checks, where we may need to restart
  a procedure from the top or return to a continuation, and convenient.

  This change affects all architectures, though only amd64 takes
  advantage of it at the moment by choosing the following
  representations:

  . Compiled-entry is a pointer to an _offset_ to instructions.

    => This way a compiled closure is just an array of offsets, and
       consing a closure entails writing no new instruction memory and
       invoking a newly consed closure entails no instruction cache
       reloads.

    => This makes a big difference for closure-heavy code -- 5x speed
       improvements in some of my microbenchmarks.  And execute caches
       (`uuo-links') can optimize away the indirection for anything
       except closures (and maybe could optimize those away too with
       some more help from the GC).

    => There is a cost: unknown calls to fixed, long-lived, often-used
       closures or to open procedures are marginally costlier because
       of the additional memory indirection, but I haven't measured
       this to be over 5% even in microbenchmarks that are nothing
       _but_ these cases.  So I think this tradeoff is well worth it.

  . Compiled-return is a pointer to a _jump instruction_.

    => This way we can use CALL to push a continuation, and a paired
       RET to return to it, which means we can take advantage of the
       CPU's return address branch target predictor.

    => The jump instruction has to go to a compiled entry with zero
       offset so we can still parse continuations just like before.
       It'll look like this:

        ;; Procedure call.
        (CALL (@PCR pushed))
        (JMP (@PCR continuation))          ; Return address points here.
pushed:
        (OR (@R ,rsp) (& ,tag))            ; Tag the continuation.
        (PUSH Q (& 123))                   ; Push additional arguments.
        ...
        (JMP (@PCR uuo-link))              ; Call procedure.
        (WORD ...)                         ; Continuation entry metadata.
        (BLOCK-OFFSET continuation)
        (QUAD U 0)                         ; Zero offset to instructions.
continuation:
        (PUSH Q (R 0))                     ; Start of continuation code.

        ;; Procedure return (ignoring interrupt checks).
        (AND Q (@R ,rsp) (R ,rbp))         ; Detag continuation.
        (RET)                              ; Pop and return.

- For INVOCATION:APPLY unknown procedure calls, call an assembly hook
  subroutine to set up an application, but generate a JMP instruction:

        (POP Q (R ,rbx))                   ; Pop procedure into rbx.
        (CALL (@RO ,regs #x128))           ; Call assembly hook.
        ;; The assembly hook sets rax to the address of the
        ;; procedure's starting instruction if it can, or to another
        ;; assembly hook that defers to the interpreter if it can't.
        (JMP (R ,rax))

  This way the CPU can train each call site with a different branch
  target prediction, rather than merging them all in an assembly hook
  that executes the JMP itself.  Gave about 2x speed increase on some
  microbenchmarks.

- Tweak some other compiler utilities to use CALL/RET responsibly.
  Interrupt routines don't because they're always going to wreck the
  return address branch prediction stack by returning to microcode
  anyway, and it's more convenient for them to have the entry address.
  On architectures where compiled entries and compiled returns have
  the same representation this makes no difference.

- Use rax, not a slot in the registers block in memory, as the return
  value register -- but still let the machine register allocator hand
  out rax.

  . Requires some changes to the RTL generator to make sure we read
    the value register first thing in a continuation, or write it last
    thing before a return.
    => As it happens, there was already a bug here with interpreter
       call results (which go in rax too) and procedures with dynamic
       links.
    => This is just a matter of inserting some temporaries, which
       evaporate into aliases for the RTL register allocator.

  . Requires some changes to the RTL optimizer to avoid propagating
    available machine registers in CSE and code compression so that
    the read and write stay at the beginning or end of a block.

  . Requires some changes to the LAP generator to avoid offering rax
    for allocation when it is the target of the RTL instruction we are
    working on right now.

  . Requires some minor changes to the assembly hooks to avoid
    clobbering the return value, e.g. trampolines now use r9b instead
    of al for the trampoline code.

- Open-code with-interrupt-mask and with-interrupts-reduced (in the
  LAP generator, not in the RTL generator) so that we don't have to go
  through reflect-to-interface, which breaks the return address branch
  target predictor.  This gave about a 2x performance improvement on
  promise operations (after I sped them up a lot by removing one of
  the without-interruptses already).

My branch seems to be pretty stable now and I'm satisfied with all the
ABI changes (much as it might be cute -- and not too hard now with the
portable fasdumper -- to also try migrating to a NaN-tagging scheme),
so I'd like to find a convenient time to merge it.  Thoughts?

(Review also welcome before I merge!)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]