qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 00/16] record/replay fixes: attempting to get avocado gree


From: Alex Bennée
Subject: Re: [PATCH v2 00/16] record/replay fixes: attempting to get avocado green
Date: Wed, 13 Dec 2023 10:57:39 +0000
User-agent: mu4e 1.11.26; emacs 29.1

Alex Bennée <alex.bennee@linaro.org> writes:

> As I'm a glutton for punishment I thought I'd have a go at fixing the
> slowly growing number of record/replay bugs. The two fixes are:
>
>  replay: stop us hanging in rr_wait_io_event
>  chardev: force write all when recording replay logs
>
> I think we are beyond 8.2 material but it would be nice to get this
> functionality stable again. We have a growing number of bugs under the
> icount label on gitlab:
>
>   https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=icount
>
> Changes
> -------
>
> v2
>
> Apart from addressing tidy ups and tags I've been investigating the
> failures in replay_linux.py which are the more exhaustive tests which
> boot the kernel and user-space. The "fix":
>
>   replay: report sync error when no exception in log (!DEBUG INVESTIGATION)
>
> triggers around the time of the hang in the logs and despite the
> rather hairy EXCP->INT transitions around cpu_exec_loop() I think
> points to a genuine problem. I added the tracing to cputlb to verify
> the page tables are the same and started detecting divergence between
> record and replay a lot earlier on that when the replay_sync_error()
> catches things. I see patterns like this:
>
>    1878 tlb_fill 0x4770c000/1 1 2                                       
> tlb_fill 0x4770c000/1 1 2
>    1879 tlb_fill 0x4770d000/1 1 2                                       
> tlb_fill 0x4770d000/1 1 2
>    1880 tlb_fill 0x59000/1 0 2                                          
> tlb_fill 0x59000/1 0 2
>    1881                                                               > 
> tlb_fill 0x476dd116/1 0 2
>    1882 tlb_fill 0x4770e000/1 1 2                                       
> tlb_fill 0x4770e000/1 1 2
>    1883 tlb_fill 0x476dd527/1 0 2                                     | 
> tlb_fill 0x476dfb17/1 0 2
>    1884                                                               > 
> tlb_fill 0x476de0fd/1 0 2
>    1885                                                               > 
> tlb_fill 0x476dce2e/1 0 2
>    1886 tlb_fill 0x4770f000/1 1 2                                       
> tlb_fill 0x4770f000/1 1 2
>    1887 tlb_fill 0x476df939/1 0 2                                     <
>    1888 tlb_fill 0x47710000/1 1 2                                       
> tlb_fill 0x47710000/1 1 2
>    1889 tlb_fill 0x47711000/1 1 2                                       
> tlb_fill 0x47711000/1 1 2
>
> These don't seem to affect the overall program flow but are concerning
> because the memory access patterns should be the same. My
> investigations with rr seem to indicate the difference is due to
> behaviour of the victim_tlb_cache which again AFAICT should be
> deterministic.
>
> Anyway I can't spend any time debugging it this week so I thought I'd
> post the current state in case anyone is curious enough to want to go
> diving into record/replay.

Just for the actual failure report, on replay:

  [  144.918551] audit: type=1130 audit(1702415469.960:45): pid=1 uid=0 
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 
msg='unit=systemd-tmpfiles-setup comm="
  systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? 
res=success'
  [  145.579135] audit: type=1305 audit(1702415470.620:46): op=set 
audit_enabled=1 old=1 auid=4294967295 ses=4294967295 
subj=system_u:system_r:auditd_t:s0 res=1
  qemu-system-aarch64: Exception not in log (insn total 30592471098/1463 left, 
event 2848747 is EVENT_INSTRUCTION)
  fish: Job 1, './qemu-system-aarch64 -display …' terminated by signal SIGABRT 
(Abort)

I've made some more fixes to the replay_dump script and the next event
expected is a CLOCK_WARP followed by some more instructions before an
exception. However there are a lot of exceptions/interrupts firing off
in the stream:

  2848736:EVENT_EXCEPTION(2) no data
  2848737:EVENT_INSTRUCTION(0) + 835 -> 30592463170
  2848738:EVENT_EXCEPTION(2) no data
  2848739:EVENT_INSTRUCTION(0) + 1133 -> 30592464303
  2848740:EVENT_EXCEPTION(2) no data
  2848741:EVENT_INSTRUCTION(0) + 4739 -> 30592469042
  2848742:EVENT_EXCEPTION(2) no data
  2848743:EVENT_INSTRUCTION(0) + 865 -> 30592469907
  2848744:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848745:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848746:EVENT_INTERRUPT(1)
  2848747:EVENT_INSTRUCTION(0) + 2654 -> 30592472561

  we are here...

  2848748:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848749:EVENT_INSTRUCTION(0) + 184 -> 30592472745
  2848750:EVENT_EXCEPTION(2) no data
  2848751:EVENT_INSTRUCTION(0) + 832 -> 30592473577
  2848752:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848753:EVENT_INSTRUCTION(0) + 6 -> 30592473583
  2848754:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848755:EVENT_INSTRUCTION(0) + 2960 -> 30592476543
  2848756:EVENT_EXCEPTION(2) no data
  2848757:EVENT_INSTRUCTION(0) + 1565 -> 30592478108
  2848758:EVENT_EXCEPTION(2) no data
  2848759:EVENT_INSTRUCTION(0) + 2996 -> 30592481104
  2848760:EVENT_EXCEPTION(2) no data
  2848761:EVENT_INSTRUCTION(0) + 832 -> 30592481936
  2848762:EVENT_ASYNC(9)
    REPLAY_ASYNC_EVENT_NET(6) id 0 flags 0 305 bytes

That said given the complexity of the exception/interrupt processing
back and forth I can see there is potential for things getting confused.
Of course I still haven't explained why there is a divergence in the tlb
processing so early on despite seemingly having no effect on the replay
until this very late stage.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro



reply via email to

[Prev in Thread] Current Thread [Next in Thread]