[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC]: port of embedded x86-mini disassembler to QEMU
From: |
Michael Clark |
Subject: |
Re: [RFC]: port of embedded x86-mini disassembler to QEMU |
Date: |
Sat, 11 Jan 2025 09:09:09 +1300 |
User-agent: |
Mozilla Thunderbird |
On 1/11/25 05:05, Paolo Bonzini wrote:
Il ven 10 gen 2025, 14:03 Michael Clark <michael@anarch128.org> ha scritto:
On 1/11/25 00:07, Paolo Bonzini wrote:
Il ven 10 gen 2025, 10:52 Michael Clark <michael@anarch128.org> ha
scritto:
a note to announce a port of the x86-mini disassembler to QEMU.
- https://github.com/michaeljclark/qemu/tree/x86-mini
I assume the huge .h files are autogenerated? If so, QEMU cannot use them
without including the human-readable sources in the tree.
yes indeed. there is an x86_tablegen.py python script in the other repo
but it is not in the current patch. it would be somewhat easy to read
the tables from CSV files directly into arrays at the expense of several
more milliseconds during startup. the revised operand formats maps
relatively strictly to enum definitions with string tables in the source
so a reader in C would not be impossible
Building the tables at compile time is fine, only leaving out the script is
not.
fair enough. I wanted to test the disassembler and I figured out how to
do that with both QEMU host and target. I haven't learned how to create
generative dependencies in meson yet but it can't be as bad as CMake.
QEMU running openssl is a pretty good torture test. I am going to spend
time analyzing the -d in_asm,out_asm logs for openssl. I don't yet have
a pseudo alias translation step so NOP still shows as XCHG eax,eax.
and fuzzing x86_64 was extremely interesting as it uncovered some
hardware bugs that led to historic findings inside the QEMU translator.
so I know that the level of accuracy is somewhat good. for example:
NOP -> XCHG eax,eax
REX.B XCHG eax,eax -> XCHG eax,r8d
PAUSE -> REP NOP -> REP XCHG eax,eax
REX.B PAUSE -> REP REX.B XCHG eax,eax -> REP XCHG eax,r8d
it seems Intel filters out REX.B for NOP but not REP NOP. and I know
what QEMU does. it does what one expects. unused REP is undefined but
typically is ignored for non string instructions with the exception of
0F, 0F38, 0F3A where REP/F3 is interpreted as part of the opcode. but
Intel has made REP XCHG eax,r8d act like REP NOP. I haven't tested this
out on AMD hardware but I consider it a silicon bug on Intel. there is a
test case on this binutils issue. in any case, this is in QEMU history.
- https://sourceware.org/bugzilla/show_bug.cgi?id=32462
-
https://www.blackhat.com/docs/us-17/thursday/us-17-Domas-Breaking-The-x86-ISA.pdf
I can see how that might be interesting for x86 virtualization where you
have only one target and therefore you can get rid of the capstone
dependency. At the same time, other virtualization targets like arm64 and
RISC-V are going to become more and more important—not less—and not
having
to maintain a disassembler ourselves as part of QEMU is also a big
plus...
yes indeed. but in an ideal world the encoders and decoders are matched
pairs. I would like to work on a translator or interpreter that uses the
same codec as the disassembler
Ok, that makes sense. QEMU already has a decoder that is very table-based
though the tables are hand written. I am not wed to it though—as long as
the code generators remain more or less unmodified, I would love to only
keep "these is how the operands are prepared for use in the IR emitters"
and make the details of x86 decoding Someone Else's Problem. So if you can
kill most (certainly not all) of the tables in
target/i386/tcg/decode-new.c.inc that would be interesting.
(I am sure you'd find some underspecified and/or wrong parts of the x86
spec, too :) For example many VEX classes are bollocks, plus some more
examples hinted at at the top of that file).
yes indeed. the metadata in the Intel SDM is littered with mistakes such
as field transpositions, typos and missing data. I would hazard a guess
that maybe ~71% of the metadata is usable in a machine readable manner.
given that LLVM tablegen has its own format, I consider x86-mini the
source of truth for metadata derived from the Intel format. although I
haven't fuzz tested again NASM yet, but I found a small number of errors
in LLVM. albeit mostly in instructions that are not used in anger.
Michael.
Paolo
anyway, in fact it is just yet another disassembler at this point, but
the codec emitter works. it doesn't yet have an arch-neutral TCG-like
API and IR to drive it yet.