[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

guile 3 update

From: Andy Wingo
Subject: guile 3 update
Date: Sun, 22 Oct 2017 12:24:33 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux)


I have been thinking on how to get from where we are to good native
compilation.  I think the next thing is instruction explosion.  As
described here:

    Currently in Guile's VM there are instructions like vector-ref.
    This is a little silly: there are also instructions to branch on the
    type of an object (br-if-tc7 in this case), to get the vector's
    length, and to do a branching integer comparison.  Really we should
    replace vector-ref with a combination of these test-and-branches,
    with real control flow in the function, and then the actual ref
    should use some more primitive unchecked memory reference
    instruction.  Optimization could end up hoisting everything but the
    primitive unchecked memory reference, while preserving safety, which
    would be a win.  But probably in most cases optimization wouldn't
    manage to do this, which would be a lose overall because you have
    more instruction dispatch.

    Well, this transformation is something we need for native
    compilation anyway.  I would accept a patch to do this kind of
    transformation on the master branch, after version 2.2.0 has forked.
    In theory this would remove most all high level instructions from
    the VM, making the bytecode closer to a virtual CPU, and likewise
    making it easier for the compiler to emit native code as it's
    working at a lower level.

I started looking at this and stumbled when figuring out how to
explode struct-ref.  In Guile 2.2, our "struct" data type is a bit
complicated, and when you go to inline it you really realize that it
should be more simple.  So the first thing I did was make some changes
to structs.  I landed some changes in stable-2.2, as described by NEWS:

    * New interfaces
    ** `struct-ref/unboxed' and `struct-set!/unboxed'
    These procedures should be used when accessing struct fields with type
    `u' (unboxed).  See "Structure Basics" in the manual, for full details.
    * New deprecations
    ** Struct tail arrays deprecated
    Guile's structures used to have a facility whereby each instance of a
    vtable can contain a variable-length tail array of values.  The length
    of the tail array was stored in the structure.  This facility was
    originally intended to allow C code to expose raw C structures with
    word-sized tail arrays to Scheme.
    However, the tail array facility was confusing and doesn't work very
    well.  It was very rarely used, but it insinuates itself into all
    invocations of `make-struct'.  For this reason the clumsily-named
    `make-struct/no-tail' procedure can actually be more elegant in actual
    use, because it doesn't have a random `0' argument stuck in the middle.
    Tail arrays also inhibit optimization by allowing instances to affect
    their shapes.  In the absence of tail arrays, all instances of a given
    vtable have the same number and kinds of fields.  This uniformity can be
    exploited by the runtime and the optimizer.  The presence of tail arrays
    make some of these optimizations more difficult.
    Finally, the tail array facility is ad-hoc and does not compose with the
    rest of Guile.  If a Guile user wants an array with user-specified
    length, it's best to use a vector.  It is more clear in the code, and
    the standard optimization techniques will do a good job with it.
    For all of these reasons, tail arrays are deprecated in Guile 2.2 and
    will be removed from Guile 3.0.  Likewise, `make-struct' /
    `scm_make_struct' is deprecated in favor of `make-struct/no-tail' /
    `scm_make_struct_no_tail'.  Perhaps one day we will be able to reclaim
    the `make-struct' name!
    ** Struct "self" slots deprecated
    It used to be that you could make a structure vtable that had "self"
    slots.  Instances of that vtable would have those slots initialized to
    the instance itself.  This can be useful in C code where you might have
    a pointer to the data array, and want to get the `SCM' handle for the
    structure.  However this was a little used complication without any use
    vin Scheme code.  To replace it, just use "p" slots and initialize the
    slot values manually on initialization.
    ** Struct fields with opaque ("o") protection deprecated
    Struct fields are declared with a "protection", meaning read-only ('r'),
    read-write ('w'), or opaque ('o').  There is also "hidden" ('o') which
    is read-write but which isn't initialized by arguments passed to
    `make-struct/no-tail', but that's a detail.  Opaque struct fields were
    used to allocate storage in a struct that could only be accessed by C.
    This facility was very rarely used (unused in Guile itself) but now that
    we are implementing more and more in Scheme, it is completely useless.
    To enforce permissions on struct fields, instead layer on an abstraction
    at a higher level, in the same way that immutable record fields are
    simply those which don't have an accessor.
    ** Using `struct-ref' and `struct-set!' on unboxed fields is deprecated
    Use the new `struct-ref/unboxed' and `struct-set!/unboxed' instead.

The master branch then removes support for deprecated behavior, and on
top of that does this:

    ** By default, GOOPS classes are not redefinable
    It used to be that all GOOPS classes were redefinable, at least in
    theory.  This facility was supported by an indirection in all "struct"
    instances, even though only a subset of structs would need redefinition.
    We wanted to remove this indirection, in order to speed up Guile
    records, allow immutable Guile records to eventually be described by
    classes, and allow for some optimizations in core GOOPS classes that
    shouldn't be redefined anyway.
    Thus in GOOPS now there are classes that are redefinable and classes
    that aren't.  By default, classes created with GOOPS are not
    redefinable.  To make a class redefinable, it should be an instance of
    `<redefinable-class>'.  See "Redefining a Class" in the manual for more
    ** Remove "self" field from vtables and "redefined" field from classes
    These fields were used as part of the machinery for class redefinition
    and is no longer needed.

Anyway, structs are more simple now.  OK now on to the heart of this
mail, instruction explosion.  I have gone through all instructions in
Guile 2.2's VM and written down "exploded" variants of them.  The goal
is to make the VM lower-level: closer to a CPU instruction set.  The
current work moves in that direction but doesn't quite get there.

Here's an annotated example of how `(vector-ref V I)' might compile.
Note that this instruction receives I as an untagged 64-bit number.

    ;; Compare the 3 lowest bits (the itag) of V to 0b000, and set the flags
    ;; register.  The flags register is internal to the VM, and has 3
    ;; bits: less-than, equal, and greater-than.  This comparison
    ;; operation will set less-than and greater-than to 0, and set the
    ;; equal bit to 1 if the itag matches the given value, and 0
    ;; otherwise.
    ;; Since 000 is the itag of heap objects (non-immediates), this
    ;; comparison is just, "is V a heap object"?
    cmp-itag3 V 000

    ;; If the equal bit is not set, go to the "not-vector" label.  This
    ;; instruction is "Jump if Not Equal", and gets its name from that.
    ;; The available conditional jumps are JL, JLE, JE, JGE, and JG,
    ;; which branch if the less-than, less-than or equal, equal,
    ;; greater-than or equal, or greater-than bits are set,
    ;; respectively.  Then there are the corresponding JNL, JNLE, JNE,
    ;; JNGE, and JNG instructions, which branch if no corresponding bit
    ;; is set.
    jne -> not-vector

    ;; Incidentally, this split between comparisons and jumps is a
    ;; departure from 2.2, where there are instructions like br-if-< and
    ;; so on.  This attempt at a 3.0 instruction set will privilege
    ;; orthogonality: there are N types of jumps that one might want to
    ;; make, and M types of comparisons, and better to have N + M rather
    ;; than N * M.  Of course some comparisons are invertible: (< N M)
    ;; may be the same as (not (>= N M)), but not all are (notably
    ;; comparisons with NaN).  So, like ARM and x86, the 3.0 VM adds a
    ;; flags register.  RISC-V incidentally uses an approach more like
    ;; 2.2 with fused compare-and-branch instructions, but sadly their
    ;; floating-point support writes to a general-purpose register, then
    ;; you have to test-and-branch on that -- so it seems that a flags
    ;; register may be best.

    ;; OK, now we compare the low bits of the first word of V to the
    ;; vector typecode (scm_tc7_vector).  All heap objects have a type
    ;; code in their low bits.
    cmp/tc7 V vector

    ;; If the tc7 matched, the equal bit of the flags register will be
    ;; set, and unset otherwise.  If it wasn't a vector, branch to the
    ;; error case.
    jne -> not-vector

    ;; If we got here, V is a vector.  To get its length, first we fetch
    ;; its first word, then we right-shift by 8 bits.
    ;; Here we fetch the first word of V and store it to register TMP as
    ;; an unsigned 64-bit integer.  Four things to note:
    ;; One, this VM still has unlimited registers, which are basically
    ;; stack slots.  Each function ensures that the stack has enough
    ;; space for its slots, in the same way as Guile 2.2.  I will figure
    ;; out register allocation later.
    ;; Two, as in Guile 2.2, the instructions are typed.  This one takes
    ;; a SCM value and a U64 index and writes a U64 value.
    ;; Thirdly, this fetch performs no checking.  Its safety should be
    ;; guaranteed by construction.  In this case, safety is ensured
    ;; because the object has an itag of 000, indicating that it's a
    ;; heap object and thus that there is at least one word addressable
    ;; at the pointer encoded by V.
    ;; Finally, you may have noted that this fetch is redundant: we
    ;; already had to fetch the first word for the cmp/tc7.  Indeed.
    ;; However it's easier for the current compiler to remove the tc7
    ;; test if it takes the vector as an argument, instead of taking the
    ;; first word as an argument.  More on this later.
    word-ref TMP V 0

    ;; Right-shift the first word by 8 bits and write the U64 result to
    ;; LENGTH.
    ursh/imm LENGTH TMP 8

    ;; Now compare the U64 value in I (the index) to LENGTH.
    cmp-u64 I length

    ;; If I was greater than or equal to LENGTH, then error.
    jge -> index out of range for vector-ref

    ;; Otherwise we're good!  The value for index I will be the word
    ;; stored at offset I+1 in V, so add 1 to I and dereference V.
    uadd/imm I* I 1
    scm-ref V I*

Now, some bigger picture things to note.

One advantage of instruction explosion is optimizability.  If we know
that V is a vector (perhaps because we already ran vector-length on it,
or because we already saw that (vector? V) => #t, or we saw its
constructor, or something like that), then the compiler can elide the
first two checks (respectively because it knows that the itag3 of a
vector is indeed 0, and because the tc7 of a vector is scm_tc7_vector).
If we know the length of the vector (or perhaps the minimum length of
the vector), the length test may fold as well, leaving us with a lone
memory reference.

Another advantage is that because this instruction set is more
orthogonal, it's easier to construct specialized variants.  For example
right now in Guile 2.2's VM, there is vector-ref and vector-ref/imm, the
second of which is when we know the index.  It's easy to see how we'd
specialize the above instruction sequence to include an immediate index;
small orthogonal instructions allow easier specialization.

Clearly an exploded instruction sequence is going to be more
instructions, unless the optimizer has an especially good day ;) So as
long as we are using a bytecode interpreter, that's going to be a risk.
Even though each bytecode is doing less work, there is a constant
overhead to dispatch of each instruction, and this overhead will play a
larger role in the performance of a Guile 3.0 bytecode interpreter.

There are a couple of things that mitigate this overhead, however,
besides the possibility that parts of an exploded instruction sequence
may be elided by the optimizer.  One is that in some near future,
instead of emitting bytecode to be interpreted by a VM running on top of
the CPU, we will emit machine code instead to be interpreted directly by
the CPU.  I think that's close, although we need to figure out some
other issues (for example, how to link to runtime routines written in
C, and how to keep portability).

However I think there's also room for a simple template JIT in the very
short term.  Because these instructions correspond directly to machine
instructions, it will be relatively little work to write a simple
template JIT, perhaps using lightning.  I am really concerned about JIT
complexity, but am happy going this way because we already do all our
optimization in the general compiler path and the JIT would really only
have to emit per-instruction machine code.  Because it doesn't need to
optimize and because it doesn't change the operational model (e.g. from
Scheme's perspective we are still working on bytecodes that are
available, can be parsed, have debugging information, etc), the
complexity is bounded.  Having a simple JIT would allow us to ensure
that even though we explode the instructions, we're still getting

I should also note about GC -- with exploded instructions, there are
more risks.  For example, instead of incrementing the vector index by
one, we could have extracted the base pointer for the vector (V+1).
However it's V that's protected by GC, not V+1; maybe having a loose V+1
reference would not be enough to keep the vector's storage alive.  So,
I've been careful to have any instruction that dereferences memory take
a SCM value as an input, assuming that SCM value protects its memory.

However with exploded instructions we might see more intermediate states
in some interrupt.  Of course proper interrupts are handled by the
handle-interrupts opcode, but there are also cross-thread GC interrupts
which currently work by sending all active threads a SIGPWR or something
like that.  Ideally in the future we could avoid signals and instead
wait for remote threads to come to safe points -- that would be best.
But until then we have to be ready for functions to be paused and GC'd
at any moment.  I think the strategy of conservatively marking the
innermost frame will continue to work fine, but we seen some reports of
cross-thread GC problems right now... I don't know.

OK, that's all the things I can think of.  Over the next month I will be
incrementally adding new instructions, changing the compiler to lower to
the new instructions (probably after closure conversion, though who
knows), and removing old instructions.  Hopefully things remain
manageably fast until we can get the JIT in place.

Thoughts welcome if I have forgotten something.  Cheers :)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]