"Julian Ganz" <neither@nut.email> writes:
(Add Richard to CC)
Hi Pierrick,
December 5, 2024 at 12:33 AM, "Pierrick Bouvier" wrote:
On 12/2/24 11:41, Julian Ganz wrote:
+static void insn_exec(unsigned int vcpu_index, void *userdata)
+{
+ struct cpu_state *state = qemu_plugin_scoreboard_find(states, vcpu_index);
+ uint64_t pc = (uint64_t) userdata;
+ GString* report;
+
+ if (state->has_next) {
+ if (state->next_pc != pc) {
+ report = g_string_new("Trap target PC mismatch\n");
+ g_string_append_printf(report,
+ "Expected: %"PRIx64"\nEncountered: %"
+ PRIx64"\n",
+ state->next_pc, pc);
+ qemu_plugin_outs(report->str);
+ if (abort_on_mismatch) {
+ g_abort();
+ }
+ g_string_free(report, true);
+ }
+ state->has_next = false;
+ }
+}
When booting an arm64 vm, I get this message:
Trap target PC mismatch
Expected: 23faf3a80
Encountered: 23faf3a84
A colleague of mine went to great lengths trying to track and reliably
reproduce this. We think that it's something amiss with the existing
instruction exec callback infrastructure. So... it's not something I'll
be addressing with the next iteration as it's out of scope. We'll
probably continue looking into it, though.
The mismatch is reported perfectly normal and boring exceptions and
interrupts with no indication of any differences to other (not reported)
events that fire on a regular basis. Apparently, once in a blue moon
(relatively speaking), for the first instruction of a handler (even
though it is definitely executed and qemu does print a trace-line for
that instruction):
| Trace 0: 0x7fffa0b03900 [00104004/000000023fde73b4/00000021/ff020200]
| Trace 0: 0x7fffa02d9580 [00104004/000000023fde72b8/00000021/ff020200]
| Trace 0: 0x7fffa02dfc40 [00104004/000000023fde7338/00000021/ff020200]
| Trace 0: 0x7fffa0b03d00 [00104004/000000023fde73d4/00000021/ff020200]
| Trace 0: 0x7fffa0b03e80 [00104004/000000023fde73d8/00000021/ff020200]
| Trace 0: 0x7fffa0b04140 [00104004/000000023fde7408/00000021/ff020200]
| Trace 0: 0x7fffa02dd6c0 [00104004/000000023fde70b8/00000021/ff020200]
| Trace 0: 0x7fffa02dd800 [00104004/000000023fde7b90/00000021/ff020200]
| cpu_io_recompile: rewound execution of TB to 000000023fde7b90
So this happens when an instruction that is not the last instruction of
the block does some IO. As IO accesses can potentially change system
state we can't allow more instructions to run in the block that might
not have that change of state captured
cpu_io_recompile exits the loop and forces the next TranslationBlock to
be only one (or maybe two instructions). We have to play games with
instrumentation to avoid double counting execution:
/*
* Exit the loop and potentially generate a new TB executing the
* just the I/O insns. We also limit instrumentation to memory
* operations only (which execute after completion) so we don't
* double instrument the instruction.
*/
cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | n;
The instruction is in a weird state having both executed (from the
plugin point of view) but not changed any state (stopped from doing MMIO
until the next instruction).
| Taking exception 5 [IRQ] on CPU 0
| ...from EL1 to EL1
| ...with ESR 0x0/0x3800000
| ...with SPSR 0x20000305
| ...with ELR 0x23fde7b90
| ...to EL1 PC 0x23fd77a80 PSTATE 0x23c5
I guess before we re-executed the new block an asynchronous interrupt
came in?
Does changing the above to:
cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
make the problem go away? It should ensure the next 1/2 instruction
block execute without checking for async events. See gen_tb_start() for
the gory details.