[epsilon-devel] [Jitter] Fast tag checking and overflow handling

epsilon-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[epsilon-devel] [Jitter] Fast tag checking and overflow handling

From:	Luca Saiu
Subject:	[epsilon-devel] [Jitter] Fast tag checking and overflow handling
Date:	Sat, 04 May 2019 13:27:21 +0200
User-agent:	Gnus (Gnus v5.13), GNU Emacs 25.1.50.2, x86_64-unknown-linux-gnu
Hello.

I am leaving this message to update you about my recent progress in
Jitter now, as I expect to be unavailable for the next couple of weeks.
In fact I have been extremely busy with other things and neglected to
answer a few private messages coming from people here on the list, which
I plan to handle now -- I apologize.

Still, yesterday I finally managed to clean up and commit a very
important change set that I had been working on in the night for some
time, and which contains several bits I am particularly proud of.  Here
it is:

  
http://ageinghacker.net/git/cgit.cgi/jitter/commit/?h=jitterlisp&id=f53a07a9796577b48b0780e2b64c43a9d67048ee

The short version of it: now a Jittery VM has a very efficient way of:

- handling integer overflow

- checking the type of tagged data

Both operations are critical in many virtual machines, particularly
those dealing with dynamically typed languages.

I believe both features compile to "optimal" code, in the sense that no
better solution comes to my mind, compatibly with the platform-specific
assembly support.  The generated machine code is exactly as it should be
on the best supported platforms, which right now are x86_64 and MIPS.
Speaking of non-user-visible changes, I completely reorganized the way
fast branches are implemented, and porting will be much easier now.  I
expect to add full fast-branch support soon to PowerPC, SPARC and SH --
SH remaining problematic; and other ports will finally come.  RISC-V
should be particularly easy, but others such as ARM are also important.

All the improvements on performance are primarily focused on the
no-threading dispatch, which should be the most widely used in practice.
Ideally there will be at least some architecture-specific support for
every mainstream architecture plus a few more exotic ones.  I know that
no-threading dispatch is still unreliable for production use -- along
with minimal-threading it is now disabled by configure unless explicitly
selected.  A fix for "the bug" will come soon, and the need for the
recent fast-branch reorganization stemmed in part from that idea.  Of
course the other slower dispatches (switch, direct-threading) remain
completely compatible with no-threading and minimal-threading, and
include all the new features.

In order to play with the new changes, let us add the following
instructions to the Uninspired VM, with two r-class fast registers:

instruction addo (?Rn 1 -1, ?Rn 0, !R, ?f)
  code
    JITTER_PLUS_BRANCH_FAST_IF_OVERFLOW (JITTER_ARGN2,
                                         JITTER_ARGN0, JITTER_ARGN1,
                                         JITTER_ARGF3);
  end
end

instruction subo (?Rn 1 0 -1, ?Rn 1 0 -1, !R, ?f)
  code
    JITTER_MINUS_BRANCH_FAST_IF_OVERFLOW (JITTER_ARGN2,
                                          JITTER_ARGN0, JITTER_ARGN1,
                                          JITTER_ARGF3);
  end
end

instruction nego (?Rn 1 0 -1 2, !R, ?f)
  code
    JITTER_NEGATE_BRANCH_FAST_IF_OVERFLOW (JITTER_ARGN1,
                                           JITTER_ARGN0,
                                           JITTER_ARGF2);
  end
end

instruction mulo (?Rn 0 1 -1 2, ?Rn 0 1 -1 2, !R, ?f)
  code
    JITTER_TIMES_BRANCH_FAST_IF_OVERFLOW (JITTER_ARGN2,
                                          JITTER_ARGN0, JITTER_ARGN1,
                                          JITTER_ARGF3);
  end
end

instruction divo (?Rn -1 0 1 2, ?Rn 0 1 -1 2 -2 4 -4, !R, ?f)
  code
    JITTER_DIVIDED_BRANCH_FAST_IF_OVERFLOW (JITTER_ARGN2,
                                            JITTER_ARGN0, JITTER_ARGN1,
                                            JITTER_ARGF3);
  end
end

instruction modo (?Rn -1 0 1 2, ?Rn 0 1 -1 2 -2 4 -4, !R, ?f)
  code
    JITTER_REMAINDER_BRANCH_FAST_IF_OVERFLOW (JITTER_ARGN2,
                                              JITTER_ARGN0, JITTER_ARGN1,
                                              JITTER_ARGF3);
  end
end

The specialized literals here are not all useful in practice, but it is
fun to see how GCC behaves when an argument is known.

There are new user macros for operating on (untagged) word-sized signed
integers, either performing an operation and writing it to the given
destination, or fast-branching away in case of overflow; on overflow no
result is written.  The order of the operands is output, inputs, label.
(Another variant is also available which performs the conditional branch
without computing the operation; and there are macros for checking
overflow in a C expression.)

Instructions such as these above are meant to be used in this way:

  addo %r0, %r1, %r2, $overflow   # Sum %r0 and %r1 into %r2, or fast-branch
  # use %r2
  # ...
$overflow:
  # Raise exception

(I will deal below with the problem of promoting to a different numeric
type.)

The idea is that the fast path of the code should be straight-line,
control falling through as long as there are no errors.  The generated
code should contain fast checks, and non-taken branches leading to the
uncommon execution path.  It is possible, as it will be shown below, to
branch from the slow execution path back into the fast branch, for
example to the instruction right after the one which detected the
overflow.

Now the fun part.

In order to make the disassembly easier to follow let us first print out
the mapping from VM resources to hardware resources (the way this works
internally is another interesting hack):

address@hidden ~/repos/jitter/_build/native-gcc-8]$ 
bin/uninspired--no-threading --print-locations /dev/null
 0.                     base: %rbx         (register)
 1.               residual 0: %r12         (register)
 2.            stack top ptr: 40(%rsp)     (memory)
 3.                      %r0: %r14         (register)
 4.                      %r1: %r13         (register)
Register ratio: 80%

Here I am on x86_64.  So, the VM register %r0 goes to the hardware
register %r14, and %r1 goes to %r13.  Good.  The rest is not
interesting here.

The (unspecialized) Uninspired VM instruction

  addo %r0, %r1, %r0, $somewhere

compiles to

# 0x78ec38: addo/%r0/%r1/%r0/fR 0x78ec50 (15 bytes):
    0x00007f019e0ff810 4c 89 f0                 movq   %r14,%rax
    0x00007f019e0ff813 4c 01 e8                 addq   %r13,%rax
    0x00007f019e0ff816 0f 80 09 00 00 00        jo     0x00007f019e0ff825
    0x00007f019e0ff81c 49 89 c6                 movq   %rax,%r14

This is, I believe, the best that can be done on x86_64.  Notice that
overflow is definitely possible here, according to the (unknown) value
of %r0, and therefore the overflow check is not optimized away and
survives into run time.

Most architectures do not have a "jump-on-overflow" instruction like
x86.  This is MIPS, and I expect that RISC-V will look essentially
identical, minus the delay slot:

address@hidden ~/repos/jitter/_build/cross-nanonote]$ ./scripts/emulator 
bin/uninspired--no-threading --print-locations /dev/null
 0.                     base: $16          (register)
 1.                  scratch: $23          (register)
 2.               residual 0: $17          (register)
 3.               residual 1: $18          (register)
 4.               residual 2: $19          (register)
 5.               residual 3: $20          (register)
 6.            stack top ptr: 4060($sp)    (memory)
 7.                      %r0: $22          (register)
 8.                      %r1: $fp          (register)
Register ratio: 88%
($fp is just another name for $30 .)
# 0x4b8a24: addo/%r0/%r1/%r0/fR 0x4b8a30 (32 bytes):
    0x7f6ff820 02de1826         xor     $3,$22,$30
    0x7f6ff824 02de1021         addu    $2,$22,$30
    0x7f6ff828 00031827         nor     $3,$0,$3
    0x7f6ff82c 00561026         xor     $2,$2,$22
    0x7f6ff830 00621024         and     $2,$3,$2
    0x7f6ff834 04400004         bltz    $2,0x7f6ff848
    0x7f6ff838 00000000         sll     $0,$0,0x0
    0x7f6ff83c 02deb021         addu    $22,$22,$30

A little less exciting at a first glance, but this is not worse than
what GCC does with __builtin_add_overflow_p , which is not used in this
case.  There are also more favorable, for example where the input
operands are equal:

  addo %r0, %r0, %r0, $somewhere

# 0x4b8a24: addo/%r0/%r0/%r0/fR 0x4b8a30 (20 bytes):
    0x7f6ff820 00161040         sll     $2,$22,0x1
    0x7f6ff824 00561026         xor     $2,$2,$22
    0x7f6ff828 04400004         bltz    $2,0x7f6ff83c
    0x7f6ff82c 00000000         sll     $0,$0,0x0
    0x7f6ff830 0016b040         sll     $22,$22,0x1

This is only one instruction more than x86_64, including the delay slot.
Not bad; in fact I believe that this (including the previous version
addo %r0, %r1, %r0, $somewhere) may also be optimal.  If you want to
look at how this works internally you will have fun.

The logic is not at all MIPS-specific, and in fact will be used
unchanged, and automatically, on every architecture without specialized
hardware support for checking overflow.  The only part that required
assembly for this is the bltz (branch on less than zero) instruction,
and the delay slot immediately following it.

The actual predicate checking for overflow is in a C macro of mine,
implementing a bitwise trick I learned from Hacker's Delight.  See
jitter/jitter-arithmetic.h for the real magic.  That is where the
bitwise xor and and come from.

GCC's builtin for checking overflow is used when available on slower
dispatches, but this solution is faster in practice when employed from
inline assembly.  A problem I am constantly fighting with is the
communication between C and inline assembly, particularly when the C
part is computing a Boolean: there is always the risk of using hardware
conditional instructions to compute a Boolean value into a register, and
then performing another conditional in assembly over its value.  That is
unacceptable in performance-critical code such as this, hence the need
for internal macros like
  JITTER_WOULD_PLUS_OVERFLOW_SIGNED_WORD
.  Checking over sum overflow would require computing
JITTER_WOULD_PLUS_OVERFLOW_SIGNED_WORD over two parameters, and then
checking for its sign.  The trick here is doing the computation in C
and the sign check in assembly, using a conditional fast branch --
which compiles to bltz on MIPS.  Of course with patch-ins and the usual
no-threading machinery.

Just for fun, these are PowerPC and SPARC, for which I have no
implementation of conditional fast branches yet:

address@hidden ~/repos/jitter/_build/cross-knuth]$ ./scripts/emulator 
bin/uninspired--no-threading --print-locations /dev/null
 0.                     base: %r14         (register)
 1.                  scratch: %r29         (register)
 2.               residual 0: %r15         (register)
 3.               residual 1: %r16         (register)
 4.               residual 2: %r17         (register)
 5.               residual 3: %r18         (register)
 6.            link register: %r22         (register)
 7.            stack top ptr: %r23         (register)
 8.                      %r0: %r28         (register)
 9.                      %r1: %r27         (register)
Register ratio: 100%

The unfavorable case:

# 0x100a3b84: addo/%r0/%r1/%r0/fR 0x100a3b90 (32 bytes):
    0xff77f820 7d 3c da 14      add     r9,r28,r27
    0xff77f824 7f 88 da 38      eqv     r8,r28,r27
    0xff77f828 7d 2a e2 78      xor     r10,r9,r28
    0xff77f82c 7d 0a 50 38      and     r10,r8,r10
    0xff77f830 2f 8a 00 00      cmpwi   cr7,r10,0
    0xff77f834 40 bc 00 08      bge     cr7,0xff77f83c
    0xff77f838 48 00 00 10      b       0xff77f848
    0xff77f83c 7d 3c 4b 78      mr      r28,r9

And the favorable case:

# 0x100a3b84: addo/%r0/%r0/%r0/fR 0x100a3b90 (24 bytes):
    0xff77f820 57 89 08 3c      rlwinm  r9,r28,1,0,30
    0xff77f824 7d 2a e2 78      xor     r10,r9,r28
    0xff77f828 2f 8a 00 00      cmpwi   cr7,r10,0
    0xff77f82c 40 bc 00 08      bge     cr7,0xff77f834
    0xff77f830 48 00 00 10      b       0xff77f840
    0xff77f834 7d 3c 4b 78      mr      r28,r9

Since Jitter does not know of a good way of implementing a fast
branch-on-less-than-zero on PowerPC, it generated a C conditional
similar to this:
  if (temporary < 0)
    unconditional-fast-branch
This explains why there is a conditional branch going right past an
unconditional branch.  Without specific assembly support PowerPC
and SPARC will have *taken* conditional branches in the common case,
rather than not taken.  Other than this the code is good.  rlwinm
is the weird (in the good sense) PowerPC rotate-left-and-mask
instruction, which here just serves to do a left shift by one bit
to compute the sum of the VM register %r0 (hardware r28) and itself.

And the same VM instruction on SPARC:

address@hidden ~/repos/jitter/_build/cross-sparc]$ ./scripts/emulator 
bin/uninspired--no-threading --print-locations /dev/null
 0.                     base: %l7          (register)
 1.                  scratch: %l6          (register)
 2.               residual 0: %i0          (register)
 3.               residual 1: %i1          (register)
 4.               residual 2: %i2          (register)
 5.               residual 3: %i3          (register)
 6.            link register: [%fp-4040]   (memory)
 7.            stack top ptr: [%fp-4036]   (memory)
 8.                      %r0: %l0          (register)
 9.                      %r1: %l1          (register)
Register ratio: 80%

# 0xa4744: addo/%r0/%r1/%r0/fR 0xa4750 (32 bytes):
    0x00000000ff4ff820 86 04 00 11      add  %l0, %l1, %g3
    0x00000000ff4ff824 84 3c 00 11      xnor  %l0, %l1, %g2
    0x00000000ff4ff828 82 18 c0 10      xor  %g3, %l0, %g1
    0x00000000ff4ff82c 80 88 80 01      btst  %g2, %g1
    0x00000000ff4ff830 36 40 00 04      bge,a,pn   %icc, 0x00000000ff4ff840
    0x00000000ff4ff834 a0 10 00 03      mov  %g3, %l0
    0x00000000ff4ff838 02 c8 00 04      brz  %g0, 0x00000000ff4ff848
    0x00000000ff4ff83c 01 00 00 00      nop 

# 0xa4744: addo/%r0/%r0/%r0/fR 0xa4750 (24 bytes):
    0x00000000ff4ff820 82 04 00 10      add  %l0, %l0, %g1
    0x00000000ff4ff824 80 98 40 10      xorcc  %g1, %l0, %g0
    0x00000000ff4ff828 36 40 00 04      bge,a,pn   %icc, 0x00000000ff4ff838
    0x00000000ff4ff82c a0 10 00 01      mov  %g1, %l0
    0x00000000ff4ff830 02 c8 00 04      brz  %g0, 0x00000000ff4ff840
    0x00000000ff4ff834 01 00 00 00      nop 

Essentially identical to PowerPC; here the unconditional branch requires
a delay slot.  Again the branch is taken in the common case but the code
is otherwise good.  (Yes, José, I know you disapprove of brz here.  I
will change it to a proper unconditional branch.)

Conditional fast branches, including on overflow, are much more subtle
than you probably think.

Let us multiply two registers.  In the Uninspired VM:

  mulo %r0, %r1, %r1, $somewhere

# 0xb95c38: mulo/%r0/%r1/%r1/fR 0xb95c50 (16 bytes):
    0x00007fcd682ff810 4c 89 f0                 movq   %r14,%rax
    0x00007fcd682ff813 49 0f af c5              imulq  %r13,%rax
    0x00007fcd682ff817 0f 80 09 00 00 00        jo     0x00007fcd682ff826
    0x00007fcd682ff81d 49 89 c5                 movq   %rax,%r13

Awesome.  On MIPS I added a hand-coded assembly solution, mostly because
it was fun to write:

# 0x4b8a24: mulo/%r0/%r1/%r1/fR 0x4b8a30 (28 bytes):
    0x7f6ff820 02de0018         mult    $22,$30
    0x7f6ff824 00002012         mflo    $4
    0x7f6ff828 00005010         mfhi    $10
    0x7f6ff82c 000417c3         sra     $2,$4,0x1f
    0x7f6ff830 144a0004         bne     $2,$10,0x7f6ff844
    0x7f6ff834 00000000         sll     $0,$0,0x0
    0x7f6ff838 0080f025         or      $30,$4,$0

The default would be a little worse, in particular performing the
multiplication twice in case of non-overflow.

So, this is good but not fancy.  Let us multiply by two instead of
computing a square:

mulo %r0, 2, %r1, $somewhere

# 0x2330c38: mulo/%r0/n2/%r1/fR 0x2330c50 (15 bytes):
    0x00007fddd82ff810 4c 89 f0                 movq   %r14,%rax
    0x00007fddd82ff813 4c 01 f0                 addq   %r14,%rax
    0x00007fddd82ff816 0f 80 09 00 00 00        jo     0x00007fddd82ff825
    0x00007fddd82ff81c 49 89 c5                 movq   %rax,%r13

This is an addq, not an imulq!  Notice that here there is inline
assembly taking two operands, and that GCC cannot do the optimization
itself.

Of course it works on the other architectures as well:

# 0x4b8a24: mulo/%r0/n2/%r1/fR 0x4b8a30 (20 bytes):
    0x7f6ff820 00161040         sll     $2,$22,0x1
    0x7f6ff824 00561026         xor     $2,$2,$22
    0x7f6ff828 04400004         bltz    $2,0x7f6ff83c
    0x7f6ff82c 00000000         sll     $0,$0,0x0
    0x7f6ff830 0016f040         sll     $30,$22,0x1

# 0x100a3b84: mulo/%r0/n2/%r1/fR 0x100a3b90 (24 bytes):
    0xff77f820 57 89 08 3c      rlwinm  r9,r28,1,0,30
    0xff77f824 7d 2a e2 78      xor     r10,r9,r28
    0xff77f828 2f 8a 00 00      cmpwi   cr7,r10,0
    0xff77f82c 40 bc 00 08      bge     cr7,0xff77f834
    0xff77f830 48 00 00 10      b       0xff77f840
    0xff77f834 7d 3b 4b 78      mr      r27,r9

No multiplications here.

Let us try a subtraction.  A register minus itself, written into another
register:

subo %r0, %r0, %r1, $somewhere

# 0xb00c38: subo/%r0/%r0/%r1/fR 0xb00c50 (3 bytes):
    0x00007f4d30b7f810 45 31 ed                 xorl   %r13d,%r13d

# 0x4b8a24: subo/%r0/%r0/%r1/fR 0x4b8a30 (4 bytes):
    0x7f6ff820 0000f025         or      $30,$0,$0

# 0x100a3b84: subo/%r0/%r0/%r1/fR 0x100a3b90 (4 bytes):
    0xff77f820 3b 60 00 00      li      r27,0

Here the check for overflow has been completely optimized away, and the
computation has been rewritten as well.  No subtraction.

Adding zero never overflows:

addo %r0, 0, %r0, $somewhere

# 0x4b8a24: addo/%r0/n0/%r0/fR 0x4b8a30 (0 bytes):

# 0x100a3b84: addo/%r0/n0/%r0/fR 0x100a3b90 (0 bytes):

# 0xa4744: addo/%r0/n0/%r0/fR 0xa4750 (0 bytes):

This is not a rewrite rule replacing entire instructions.  It is the
internal implementation of branch primitives, which is now quite
involved and layered.  There is now a complex machine-independent logic
which rewrites, at C compile time, uses of machine-specific "low-level
fast branches" into simpler versions, or eliminates them altogether when
needed.

This extends to almost all of the other operators, such as a simple
less-than.  JITTER_BRANCH_FAST_IF_LESS_SIGNED with a known-zero operand
is rewritten into a sign check.

Now, there are a lot of comparison operators, including two new ones
which are important for performance and I will discuss below.  The logic
for optimizing them becomes easily quite complex, and it involves
algebraic reasoning such as this:

# Pairs of each binary relation R(_,_) and its associated unary relation S(_)
# such that for every y : R(0,y) iff S(y).
# For example:
#   0 != y   iff  y is nonzero
#   0 < y    iff  y is positive
#   0 & y    iff  never (y)
#   ~(0 & y) iff  always (y)
zerolefts="equal/zero notequal/nonzero less/positive notless/nonpositive 
greater/negative notgreater/nonnegative and/never nand/always"

# Pairs of each binary relation and its symmetrical.
# For example, for every x,y : x <= y iff y >= x.
symmetries="equal/equal notequal/notequal less/greater notless/notgreater 
and/and nand/nand"

# Pairs of each relation, unary or binary, and its opposite.
# For example:
#   for every x    : x is nonzero iff ! (x is zero).
#   for every x, y : x <= y iff ! (x > y).
oppositions="never/always zero/nonzero positive/nonpositive 
negative/nonnegative equal/notequal less/notless greater/notgreater and/nand"

Yes, this is shell syntax.  And yes, José, in the end I did it.  I am
now using a shell script (preprocessed by M4sh and config.local) to
machine-generate a complicated and repetitive header file.

I find it sad that CPP is so terrible at expressing declarative
knowledge such as the example above that a programming language as crude
as the Unix shell can beats it, but this the reality we have to live in.

A consequence of this machine-generated code is that I can move some of
the logic away from C constant expressions, which make compilation
heavy, to preprocessing, by generated many variant copies of similar
code instead of choosing at runtime.  A script generating CPP macro
definitions is better at abstraction than CPP.

Jittery VMs remain very heavyweight to compile.  Some improvements are
possible, but the tradeoff between compilation time and run-time
efficiency remains.  I believe that this is a good compromise.


Now, for the last exciting improvement, let us switch to JitterLisp as
an example.

Dynamically typed languages such as Lisps rely heavily on type checking
and type dispatch.  I am now focusing on type checking, making the
hypothesis, certainly correct in many cases, that the patterns of use of
late-binding operations are very skewed in favor of one particular
combination of type arguments.  For example, + in Lisp will almost
always take fixnum arguments.  Being efficient in handling fixnums is
crucial: all the other cases are less prioritary, and we can sacrifice
performance in the less common cases if in exchange we make fixnum
operations very fast.

I have implemented, in dirty experimental code which is not on git and
probably will never be, these macros.  I am publishing them here since
they are not available anywhere else, but their exact working is not
essential and you can safely skip them if you are not curious.

#define JITTERLISP_TAG_BITMASK(_tag_bit_no) ((1 << (_tag_bit_no)) - 1)
#define JITTERLISP_BRANCH_FAST_IF_HAS_TAG(_x, _tag, _tag_bit_no, tgt) \
  do \
    { \
      JITTER_BRANCH_FAST_IF_NAND ((jitterlisp_object) (_x) - (_tag), \
                                  JITTERLISP_TAG_BITMASK(_tag_bit_no), \
                                  tgt); \
    }\
  while (false)
#define JITTERLISP_BRANCH_FAST_IF_HASNT_TAG(_x, _tag, _tag_bit_no, tgt) \
  do \
    { \
      JITTER_BRANCH_FAST_IF_AND ((jitterlisp_object) (_x) - (_tag), \
                                 JITTERLISP_TAG_BITMASK(_tag_bit_no), \
                                 tgt); \
    }\
  while (false)
#define JITTERLISP_BRANCH_FAST_IF_FIXNUM(_x, tgt) \
  JITTERLISP_BRANCH_FAST_IF_HAS_TAG((_x), JITTERLISP_FIXNUM_TAG, 
JITTERLISP_FIXNUM_TAG_BIT_NO, (tgt))
#define JITTERLISP_BRANCH_FAST_IF_NONFIXNUM(_x, tgt) \
  JITTERLISP_BRANCH_FAST_IF_HASNT_TAG((_x), JITTERLISP_FIXNUM_TAG, 
JITTERLISP_FIXNUM_TAG_BIT_NO, (tgt))
#define JITTERLISP_BRANCH_FAST_IF_CONS(_x, tgt) \
  JITTERLISP_BRANCH_FAST_IF_HAS_TAG((_x), JITTERLISP_CONS_TAG, 
JITTERLISP_CONS_TAG_BIT_NO, (tgt))
#define JITTERLISP_BRANCH_FAST_IF_NONCONS(_x, tgt) \
  JITTERLISP_BRANCH_FAST_IF_HASNT_TAG((_x), JITTERLISP_CONS_TAG, 
JITTERLISP_CONS_TAG_BIT_NO, (tgt))

These rely on the new fast branches on the AND and NAND predicates,
which turn out to be very important in practice for fast tag checking.

I want this kind of VM code (I am keeping it stack-based for the time
being, but this is not important):

                          primitive-one-plus
$right_after_primitive:
                          # Use the result of 1+
                          # ...

$non-fixnum:
                          branch-on-non-number $error
                          slow-one-plus
                          goto $right_after_primitive
$one-plus-overflow:
                          promote-to-bignum
                          goto $right-after-primitive

$error:
                          # raise exception

Again, the fast path of the code must be straight-line, with no taken
branches.  If 1+ has a fixnum argument (and there is no overflow), no
branch is taken.
If one argument type is unexpected (either an invalid type,
or an *uncommon* operand type), we branch off to out-of-line code, where
every case can be dealt with; even calling a C function in this slow
path should be perfectly acceptable.  The slow path will either fail, or
promote the operand to some other type as needed.  Another separate slow
path deals with overflow, which is performance-critical enough (checks
for overflows are *everywhere* in safe runtimes) to deserve special
support in Jitter.

So, how do we test, rapidly, if a certain object (say on the top of the
main stack) has a given tag?  In JitterLisp, and in the forthcoming
Jitter-generated code for tagged data specified by the user, I tag
objects in their low bits.  If their tag bitmask is zero then the tag
check looks like this:

  JITTER_BRANCH_FAST_IF_AND (object, TAG_BIT_MASK, label);

If the result of a bitwise and gives non-zero, then the tag check
failed.  This can be done in two hardware instructions on every machine
I have thought of (maybe SH requires one move to r0; maybe not; I
forgot.  It's not essential), using some variant of compare then
conditional-branch, or and then branch-on-nonzero.

Now, fixnums should be tagged with zero, because this also makes several
arithmetic operations faster -- see
  example-vms/jitterlisp/jitterlisp-operations.h
for conditional code.  And yes, files such as this should be and will be
generated automatically by Jitter, at least for the part about
arithmetic operations.

Fixnum tagged with zero will also make overflow checking faster:
remember that overflow checking is implemented at a low level, and
relies on assembly.  I cannot do overflow checks on, say, 29-bit
numbers.  That can be done in C, and in fact I have recently added
support for this in Jitter, but it will never be as efficient.

Still, other tags need to be checked for.

Forget that in Common Lisp car and cdr also accept nil as an argument.
JitterLisp does not.  Then, car (like cdr) becomes another interesting
test case.  car takes one argument, fast-branches away if it is not a
cons, and does its thing.  Conses will be tagged, in their pointer, with
a non-zero value.  So the code for a car operation will look like this:

  JITTER_BRANCH_FAST_IF_NAND (object - CONS_TAG, TAG_BIT_MASK, label);
  result = * (object_t *) (object - CONS_TAG);

I first subtract the value of the tag from a tagged object: this, if
the argument was really a tagged pointer to a cons, will leave the tag
bits set to zero.  If by masking those bits I get nonzero (NAND) I have
to branch away for error handling.

(The next line setting the result just shows that loading the car from a
tagged pointer can be done with one load instruction, using an offset to
compensate for the tag bits.  The idea works for cdr as well, and I have
always used this style in JitterLisp.)

Back to the non-zero tag check, this costs one more assembly instruction
than the zero tag check, because of the subtraction.  As unfortunate as
this is, I cannot think of any better solution.  Notice that one
instruction suffices even on the two-operand x86: a leaq (leal in 32-bit
code) instruction is enough to add a small constant to a register,
writing into another register.

So this is a good, fast implementation of car:

instruction primitive-car (?f)
  code
    JITTERLISP_BRANCH_FAST_IF_NONCONS (JITTER_TOP_MAINSTACK(), JITTER_ARGF0);
    JITTERLISP_CAR_(JITTER_TOP_MAINSTACK(), JITTER_TOP_MAINSTACK());
  end
end

And this is what the code looks like:
jitterlisp> (dump-data-locations)
 0.                     base: %rbx         (register)
 1.               residual 0: %r12         (register)
 2.            mainstack top: %rbp         (register)
 3.   mainstack undertop ptr: %r13         (register)
 4.      returnstack top ptr: %r14         (register)
Register ratio: 100%

# 0x6478b8: primitive-car/fR 0x6478c0 (20 bytes):
    0x00007fa71fffeea4 48 8d 45 fe              leaq   -0x2(%rbp),%rax
    0x00007fa71fffeea8 48 a9 07 00 00 00        testq  $0x7,%rax
    0x00007fa71fffeeae 0f 85 11 00 00 00        jne    0x00007fa71fffeec5
    0x00007fa71fffeeb4 48 8b 6d fe              movq   -0x2(%rbp),%rbp

Untag, test for zero, conditional branch.  Then do the operation.

The same on MIPS:

jitterlisp> (dump-data-locations)
 0.                     base: $16          (register)
 1.                  scratch: $23          (register)
 2.               residual 0: $17          (register)
 3.               residual 1: $18          (register)
 4.               residual 2: $19          (register)
 5.            mainstack top: $3           (register)
 6.   mainstack undertop ptr: $22          (register)
 7.      returnstack top ptr: $fp          (register)
# 0x4d6504: primitive-car/fR 0x4d6508 (20 bytes):
    0x7f5feebc 2462fffe         addiu   $2,$3,-2
    0x7f5feec0 30420007         andi    $2,$2,0x7
    0x7f5feec4 14400008         bnez    $2,0x7f5feee8
    0x7f5feec8 00000000         sll     $0,$0,0x0
    0x7f5feecc 8c63fffe         lw      $3,-2($3)

And finally here is the fast path of 1+, checking for the type and for
overflow:

# 0x13078b8: primitive-one-plus/fR 0x13078c0 (29 bytes):
    0x00007fcbde47eea4 48 f7 c5 0f 00 00 00     testq  $0xf,%rbp
    0x00007fcbde47eeab 0f 85 1d 00 00 00        jne    0x00007fcbde47eece
    0x00007fcbde47eeb1 48 89 e8                 movq   %rbp,%rax
    0x00007fcbde47eeb4 48 83 c0 10              addq   $0x10,%rax
    0x00007fcbde47eeb8 0f 80 10 00 00 00        jo     0x00007fcbde47eece
    0x00007fcbde47eebe 48 89 c5                 movq   %rax,%rbp

# 0x4d6d24: primitive-one-plus/fR 0x4d6d28 (44 bytes):
    0x7f5fe3c4 3062000f         andi    $2,$3,0xf
    0x7f5fe3c8 1440000f         bnez    $2,0x7f5fe408
    0x7f5fe3cc 00000000         sll     $0,$0,0x0
    0x7f5fe3d0 24640010         addiu   $4,$3,16
    0x7f5fe3d4 2402ffef         addiu   $2,$0,-17
    0x7f5fe3d8 00621026         xor     $2,$3,$2
    0x7f5fe3dc 00642826         xor     $5,$3,$4
    0x7f5fe3e0 00451024         and     $2,$2,$5
    0x7f5fe3e4 04400008         bltz    $2,0x7f5fe408
    0x7f5fe3e8 00000000         sll     $0,$0,0x0
    0x7f5fe3ec 00801825         or      $3,$4,$0

I am interested in feedback, as usual.  Please leave comments.

Being back soon,

-- 
Luca Saiu
* GNU epsilon:           http://www.gnu.org/software/epsilon
* My personal web site:  http://ageinghacker.net

I support everyone's freedom of mocking any opinion or belief, no
matter how deeply held, with open disrespect and the same unrelented
enthusiasm of a toddler who has just learned the word "poo".
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
[epsilon-devel] [Jitter] Fast tag checking and overflow handling, Luca Saiu <=
Index(es):
- Date
- Thread