[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH v3 11/11] tests: add plugin asserting correctness of disc
From: |
Alex Bennée |
Subject: |
Re: [RFC PATCH v3 11/11] tests: add plugin asserting correctness of discon event's to_pc |
Date: |
Thu, 09 Jan 2025 16:35:54 +0000 |
User-agent: |
mu4e 1.12.8; emacs 29.4 |
Pierrick Bouvier <pierrick.bouvier@linaro.org> writes:
> Hi Julian,
>
> On 12/20/24 03:47, Julian Ganz wrote:
>> Hi Pierrick,
>> December 5, 2024 at 12:33 AM, "Pierrick Bouvier" wrote:
>>> On 12/2/24 11:41, Julian Ganz wrote:
>>>> +static void insn_exec(unsigned int vcpu_index, void *userdata)
>>>> +{
>>>> + struct cpu_state *state = qemu_plugin_scoreboard_find(states,
>>>> vcpu_index);
>>>> + uint64_t pc = (uint64_t) userdata;
>>>> + GString* report;
>>>> +
>>>> + if (state->has_next) {
>>>> + if (state->next_pc != pc) {
>>>> + report = g_string_new("Trap target PC mismatch\n");
>>>> + g_string_append_printf(report,
>>>> + "Expected: %"PRIx64"\nEncountered: %"
>>>> + PRIx64"\n",
>>>> + state->next_pc, pc);
>>>> + qemu_plugin_outs(report->str);
>>>> + if (abort_on_mismatch) {
>>>> + g_abort();
>>>> + }
>>>> + g_string_free(report, true);
>>>> + }
>>>> + state->has_next = false;
>>>> + }
>>>> +}
>>>>
>>> When booting an arm64 vm, I get this message:
>>> Trap target PC mismatch
>>> Expected: 23faf3a80
>>> Encountered: 23faf3a84
>> A colleague of mine went to great lengths trying to track and
>> reliably
>> reproduce this. We think that it's something amiss with the existing
>> instruction exec callback infrastructure. So... it's not something I'll
>> be addressing with the next iteration as it's out of scope. We'll
>> probably continue looking into it, though.
>> The mismatch is reported perfectly normal and boring exceptions and
>> interrupts with no indication of any differences to other (not reported)
>> events that fire on a regular basis. Apparently, once in a blue moon
>> (relatively speaking), for the first instruction of a handler (even
>> though it is definitely executed and qemu does print a trace-line for
>> that instruction):
>> | Trace 0: 0x7fffa0b03900
>> [00104004/000000023fde73b4/00000021/ff020200]
>> | Trace 0: 0x7fffa02d9580 [00104004/000000023fde72b8/00000021/ff020200]
>> | Trace 0: 0x7fffa02dfc40 [00104004/000000023fde7338/00000021/ff020200]
>> | Trace 0: 0x7fffa0b03d00 [00104004/000000023fde73d4/00000021/ff020200]
>> | Trace 0: 0x7fffa0b03e80 [00104004/000000023fde73d8/00000021/ff020200]
>> | Trace 0: 0x7fffa0b04140 [00104004/000000023fde7408/00000021/ff020200]
>> | Trace 0: 0x7fffa02dd6c0 [00104004/000000023fde70b8/00000021/ff020200]
>> | Trace 0: 0x7fffa02dd800 [00104004/000000023fde7b90/00000021/ff020200]
>> | cpu_io_recompile: rewound execution of TB to 000000023fde7b90
>> | Taking exception 5 [IRQ] on CPU 0
>> | ...from EL1 to EL1
>> | ...with ESR 0x0/0x3800000
>> | ...with SPSR 0x20000305
>> | ...with ELR 0x23fde7b90
>> | ...to EL1 PC 0x23fd77a80 PSTATE 0x23c5
>> | Trace 0: 0x7fffa13a8340 [00104004/000000023fd77a80/00000021/ff021201]
>> | Trace 0: 0x7fffa13a8480 [00104004/000000023fd77a84/00000021/ff020200]
>> | Trap target PC mismatch CPU 0
>> | Expected: 23fd77a80
>> | Encountered: 23fd77a84
>> | warning: 44 ./nptl/pthread_kill.c: No such file or directory
>> | Couldn't get registers: No such process.
>> It does show up with both single-core and multi-core VMs, so that at
>> least eliminates some possibilities. Maybe :/
>> The issue is nasty to reproduce in a way that allows any meaningful
>> investigation. It usually involves sifting through many GBs of Qemu logs
>> for maybe one occurance. We could add another testing/dummy plugin that
>> just prints the PC for _any_ instruction executed and have a skript
>> check for non-alternating Trace-lines from Qemu and that Plugin. But
>> then we're talking nearly double the amount of Lines to look through
>> with probably little additional information.
>>
>
> Thanks for the investigation.
> I could reproduce this with this command line:
> ./build/qemu-system-aarch64 -M virt -plugin
> ./build/tests/tcg/plugins/libdiscons.so,abort=on -m 8G -device
> virtio-blk-pci,drive=root -drive
> if=none,id=root,file=/home/user/.work/images/debianaarch64.img -M virt
> -cpu max,pauth=off -drive
> if=pflash,readonly=on,file=/usr/share/AAVMF/AAVMF_CODE.fd -drive
> if=pflash,file=/home/user/.work/images/AAVMF_VARS.fd -d
> plugin,in_asm,op -D crash.log
>
> # -d plugin,in_asm,op allows to dump asm of every translated block,
> plugin output (for discon plugin), and tcg op generated.
>
> It reliably crashes with a single address.
> Looking at the debug output (crash.log):
> ----------------
> IN:
> 0x23faf3a80: d108c3ff sub sp, sp, #0x230
> # => This bb has a single instruction as input
>
> OP:
> # this is the TB instrumentation
> ld_i32 loc0,env,$0xfffffffffffffff8
> brcond_i32 loc0,$0x0,lt,$L0
> st8_i32 $0x1,env,$0xfffffffffffffffc
>
> ---- 0000000000000a80 0000000000000000 0000000000000000
> # => we can see that there is no call_plugin, looks like
> instrumentation # is not applied
That is expected. We have previously called the instrumentation for the
instruction in the block that triggered cpu_io_recompile() so we suppress
everything apart from memory instrumentation for the re-execution so we
don't double count.
> sub_i64 sp,sp,$0x230
> add_i64 pc,pc,$0x4
> goto_tb $0x1
> exit_tb $0x7f7eedd355c1
> set_label $L0
> exit_tb $0x7f7eedd355c3
>
> ----------------
> IN:
> 0x23faf3a84: a9b007e0 stp x0, x1, [sp, #-0x100]!
> 0x23faf3a88: a9010fe2 stp x2, x3, [sp, #0x10]
> ...
>
> OP:
> ld_i32 loc0,env,$0xfffffffffffffff8
> brcond_i32 loc0,$0x0,lt,$L0
> st8_i32 $0x0,env,$0xfffffffffffffffc
>
> ---- 0000000000000a84 0000000000000000 0000000000000000
> # instruction is correctly applied
> call plugin(0x7f7eec96d530),$0x1,$0,$0x0,$0x23faf3a84
> mov_i64 loc2,sp
> ...
>
> Trap target PC mismatch
> Expected: 23faf3a80
> Encountered: 23faf3a84
>
> The thing interesting here is that we can notice that 23faf3a80 is a
> translation block with a single instruction, and we can see that
> instrumentation is not applied for this instruction (call_plugin is
> not present).
>
> Overall, it really looks like a bug on QEMU side, where we miss
> instrumenting something. I'll take a look. You can ignore this for
> now.
See my other email. I think the bug was allowing an async IRQ to
interfere with our "special" single instruction block.
>
>> Regards,
>> Julian
--
Alex Bennée
Virtualisation Tech Lead @ Linaro