This includes:
- implementing SHA and CMPccXADD instruction extensions
- introducing a new mechanism for flags writeback that avoids a
tricky failure
- converting the more orthogonal parts of the one-byte opcode
map, as well as the CMOVcc and SETcc instructions.
Tested by booting several 32-bit and 64-bit guests.
The new decoder produces roughly 2% more ops, but after optimization there
are just 0.5% more and almost all of them come from cmp instructions.
For some reason that I have not investigated, these end up with an extra
mov even after optimization:
sub_i64 tmp0,rax,$0x33
mov_i64 cc_src,$0x33 mov_i64 cc_dst,tmp0
sub_i64 cc_dst,rax,$0x33 mov_i64 cc_src,$0x33
discard cc_src2 discard cc_src2
discard cc_op discard cc_op
It could be easily fixed by not reusing gen_SUB for cmp instructions,
or by debugging what goes on in the optimizer. However, it does not
result in larger assembly.