qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Possible race condition in aspeed ast2600 smp boot on TCG QEMU


From: Troy Lee
Subject: RE: Possible race condition in aspeed ast2600 smp boot on TCG QEMU
Date: Mon, 15 Jan 2024 08:36:41 +0000

Hi Stephen and Cedric,

This issue haven't been found in real platform but sometime happens in
emulator, e.g. Simic. 

> Adding Aspeed Engineers. This reminds me of a discussion a while ago.
> 
> On 1/11/24 18:38, Stephen Longfield wrote:
> > We’ve noticed inconsistent behavior when running a large number of aspeed
> ast2600 executions, that seems to be tied to a race condition in the smp boot
> when executing on TCG-QEMU, and were wondering what a good mediation
> strategy might be.
> >
> > The problem first shows up as part of SMP boot. On a run that’s likely to
> later run into issues, we’ll see something like:
> >
> > ```
> > [    0.008350] smp: Bringing up secondary CPUs ...
> > [    1.168584] CPU1: failed to come online [    1.187277] smp: Brought
> > up 1 node, 1 CPU ```
> >
> > Compared to the more likely to succeed:
> >
> > ```
> > [    0.080313] smp: Bringing up secondary CPUs ...
> > [    0.093166] smp: Brought up 1 node, 2 CPUs [    0.093345] SMP:
> > Total of 2 processors activated (4800.00 BogoMIPS).
> > ```
> >
> > It’s somewhat reliably reproducible by running the ast2600-evb with an
> OpenBMC image, using ‘-icount auto’ to slow execution and make the race
> condition more frequent (it happens without this, just easier to debug if we
> can reproduce):
> >
> >
> > ```
> > ./aarch64-softmmu/qemu-system-aarch64 -machine ast2600-evb -
> nographic
> > -drive
> > file=~/bmc-bin/image-obmc-ast2600,if=mtd,bus=0,unit=0,snapshot=on -nic
> > user -icount auto ```

Have you try to run qemu with "-smp 2"?

> >
> > Our current hypothesis is that the problem comes up in the platform
> uboot.  As part of the boot, the secondary core waits for the smp mailbox to
> get a magic number written by the primary core:
> >
> > https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-
> v2019.04/a
> > rch/arm/mach-aspeed/ast2600/platform.S#L168
> > <https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-
> v2019.04/
> > arch/arm/mach-aspeed/ast2600/platform.S#L168>
> >
> > However, this memory address is cleared on boot:
> >
> > https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-
> v2019.04/a
> > rch/arm/mach-aspeed/ast2600/platform.S#L146
> > <https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-
> v2019.04/
> > arch/arm/mach-aspeed/ast2600/platform.S#L146>
> >
> > The race condition occurs if the primary core runs far ahead of the 
> > secondary
> core: if the primary core gets to the point where it signals the secondary 
> core’s
> mailbox before the secondary core gets past the point where it does the 
> initial
> reset and starts waiting, the reset will clear the signal, and then the 
> secondary
> core will never get past the point where it’s looping in
> `poll_smp_mbox_ready`.
> >
> > We’ve observed this race happening by dumping all SCU reads and writes,
> and validated that this is the problem by using a modified `platform.S` that
> doesn’t clear the =SCU_SMP_READY mailbox on reset, but would rather not
> have to use a modified version of SMP boot just for QEMU-TCG execution.

To prevent the race condition described, SCU188 zeroization is conducted
as early as possible by both CPU#0 and CPU#1. After that, there are at 
least 100 instructions for CPU#0 to execute before it get the chance to
set SCU188 to 0xbabecafe. For real, parallel HW, it is unusual that CPU#1
will be slower than CPU#0 by 100 instruction cycles.

> 
> you could use '-trace aspeed_scu*' to collect the MMIO accesses on the SCU
> unit. A TCG plugin also.
> 
> > Is there a way to have QEMU insert a barrier synchronization at some point
> in the bootloader?  I think getting both cores past the =SCU_SMP_READY reset
> would get rid of this race, but I’m not aware of a way to do that kind of 
> thing
> in QEMU-TCG.
> >
> > Thanks for any insights!
> 
> Could we change the default value to registers 0x180 ... 0x18C in
> hw/misc/aspeed_scu.c to make sure the SMP regs are immune to the race ?
> 
> Thanks,
> 
> C.

Thanks,
Troy Lee

reply via email to

[Prev in Thread] Current Thread [Next in Thread]