[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event
From: |
Nicholas Piggin |
Subject: |
Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event" |
Date: |
Thu, 14 Mar 2024 15:19:08 +1000 |
On Wed Mar 13, 2024 at 7:03 AM AEST, Alex Bennée wrote:
> "Nicholas Piggin" <npiggin@gmail.com> writes:
>
> > On Tue Mar 12, 2024 at 11:33 PM AEST, Alex Bennée wrote:
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>
> >> > This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.
> >> >
> >> > That commit causes reverse_debugging.py test failures, and does
> >> > not seem to solve the root cause of the problem x86-64 still
> >> > hangs in record/replay tests.
> >>
> >> I'm still finding the reverse debugging tests failing with this series.
> >
> > :(
> >
> > In gitlab CI or your own testing? What are you running exactly?
>
> My own - my mistake I didn't get a clean build because of the format
> bug. However I'm seeing new failures:
>
> env QEMU_TEST_FLAKY_TESTS=1 AVOCADO_TIMEOUT_EXPECTED=1 ./pyvenv/bin/avocado
> run ./tests/avocado/reverse_debugging.py
> Fetching asset from
> ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt
> JOB ID : bd4b29f7afaa24dc6e32933ea9bc5e46bbc3a5a4
> JOB LOG :
> /home/alex/avocado/job-results/job-2024-03-12T20.58-bd4b29f/job.log
> (1/5)
> ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_pc:
> PASS (4.49 s)
> (2/5)
> ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_q35:
> PASS (4.50 s)
> (3/5)
> ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt:
> FAIL: Invalid PC (read ffff2d941e4d7f28 instead of ffff2d941e4d7f2c) (3.06 s)
Okay, this is the new test I added. It runs for 1 second then
reverse-steps from the end of the trace. aarch64 is flaky -- pc is at a
different place at the same icount after the reverse-step (which is
basically the second replay). This indicates some non-determinism in
execution, or something in machine reset or migration is not restoring
the state exactly.
aarch64 ran okay few times including gitlab CI before I posted the
series, but turns out it does break quite often too.
x86 has a problem with this too so I disabled it there. I'll disable it
for aarch64 too for now.
x86 and aarch64 can run the replay_linux.py test quite well (after this
series), which is much longer and more complicated. The difference there
is that it is only a single replay, it never resets the machine or
loads the initial snapshot for reverse-debugging. So to me that
indicates that execution is probably deterministic, but its the reset
reload that has the problem.
Thanks,
Nick
- [PATCH v4 02/24] scripts/replay-dump.py: rejig decoders in event number order, (continued)
[PATCH v4 06/24] chardev: set record/replay on the base device of a muxed device, Nicholas Piggin, 2024/03/11
[PATCH v4 07/24] replay: Fix migration use of clock, Nicholas Piggin, 2024/03/11
[PATCH v4 08/24] replay: Fix migration replay_mutex locking, Nicholas Piggin, 2024/03/11
[PATCH v4 09/24] virtio-net: Use replay_schedule_bh_event for bhs that affect machine state, Nicholas Piggin, 2024/03/11
[PATCH v4 10/24] virtio-net: Use virtual time for RSC timers, Nicholas Piggin, 2024/03/11
[PATCH v4 11/24] net: Use virtual time for net announce, Nicholas Piggin, 2024/03/11