On 6/29/22 16:14, Alex Bennée wrote:
Cédric Le Goater <clg@kaod.org> writes:
On 6/24/22 18:50, Cédric Le Goater wrote:
On 6/23/22 20:43, Peter Delevoryas wrote:
On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
On 6/23/22 12:26, Peter Delevoryas wrote:
Signed-off-by: Peter Delevoryas <pdel@fb.com>
Let's start simple without flash support. We should be able to
load FW blobs in each CPU address space using loader devices.
Actually, I was unable to do this, perhaps because the fb OpenBMC
boot sequence is a little weird. I specifically _needed_ to have
a flash device which maps the firmware in at 0x2000_0000, because
the fb OpenBMC U-Boot SPL jumps to that address to start executing
from flash? I think this is also why fb OpenBMC machines can be so slow.
$ ./build/qemu-system-arm -machine fby35 \
-device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
-d int -drive file=fby35.mtd,format=raw,if=mtd
Ideally we should be booting from the flash device directly using
the machine option '-M ast2600-evb,execute-in-place=true' like HW
does. Instructions are fetched using SPI transfers. But the amount
of code generated is tremendous.
Yeah because there is a potential race when reading from HW so we
throw
away TB's after executing them because we have no way of knowing if it
has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
execution code) which cleaned up this handling.
See some profiling below for a
run which barely reaches DRAM training in U-Boot.
Some more profiling on both ast2500 and ast2600 machines shows :
* ast2600-evb,execute-in-place=true :
Type Object Call site Wait Time (s) Count
Average (us)
---------------------------------------------------------------------------------------------
BQL mutex 0x564dc03922e0 accel/tcg/cputlb.c:1365 14.21443
32909927 0.43
This is unavoidable as a HW access needs the BQL held so we will go
through this cycle every executed instruction.
Did I miss why the flash contents are not mapped into the physical
address space? Isn't that how it appear to the processor?
There are two modes :
if (ASPEED_MACHINE(machine)->mmio_exec) {
memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
&fl->mmio, 0, size);
memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
boot_rom);
} else {
memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
size, &error_abort);
memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
boot_rom);
write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
}
The default boot mode uses the ROM. No issue.
The "execute-in-place=true" option creates an alias on the region of
the flash contents and each instruction is then fetched from the flash
drive with SPI transactions.
With old FW images, using an older U-boot, the machine boots in a couple
of seconds. See the profiling below for a witherspoon-bmc machine using
U-Boot 2016.07.
qemu-system-arm -M witherspoon-bmc,execute-in-place=true -drive
file=./flash-witherspoon-bmc,format=raw,if=mtd -drive
file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults
-snapshot -serial mon:stdio -enable-sync-profile
...
U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
Watchdog enabled
DRAM: 496 MiB
Flash: 32 MiB
In: serial
Out: serial
Err: serial
Net:
(qemu) info sync-profile
Type Object Call site Wait Time (s)
Count Average (us)
---------------------------------------------------------------------------------------------
BQL mutex 0x56189610b2e0 accel/tcg/cputlb.c:1365 0.25311
12346237 0.02
condvar 0x5618970cf220 softmmu/cpus.c:423 0.05506
2 27530.78
BQL mutex 0x56189610b2e0 util/rcu.c:269 0.04709
2 23544.26
condvar 0x561896d0fc78 util/thread-pool.c:90 0.01340
83 161.47
condvar 0x56189610b240 softmmu/cpus.c:571 0.00005
1 54.93
condvar 0x56189610b280 softmmu/cpus.c:642 0.00003
1 32.88
BQL mutex 0x56189610b2e0 util/main-loop.c:318 0.00003
34 0.76
mutex 0x561896eade00 tcg/region.c:204 0.00002
995 0.02
rec_mutex [ 2] util/async.c:682 0.00002
493 0.03
mutex [ 2] chardev/char.c:118 0.00001
404 0.03
---------------------------------------------------------------------------------------------
However, with recent U-boots, it takes quite a while to reach DRAM training.
Close to a minute. See the profiling below for an ast2500-evb machine using
U-Boot 2019.04.
qemu-system-arm -M ast2500-evb,execute-in-place=true -net
nic,macaddr=C0:FF:EE:00:00:03,netdev=net0 -drive
file=./flash-ast2500-evb,format=raw,if=mtd -nographic -nodefaults -snapshot
-serial mon:stdio -enable-sync-profile
qemu-system-arm: warning: Aspeed iBT has no chardev backend
qemu-system-arm: warning: nic ftgmac100.1 has no peer
QEMU 7.0.50 monitor - type 'help' for more information
U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
+0200)
SOC : AST2500-A1
RST : Power On
LPC Mode : SIO:Enable : SuperIO-2e
Eth : MAC0: RGMII, , MAC1: RGMII,
Model: AST2500 EVB
DRAM: 448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
MMC: sdhci_slot0@100: 0, sdhci_slot1@200: 1
Loading Environment from SPI Flash... SF: Detected mx25l25635e with page
size 256 Bytes, erase size 64 KiB, total 32 MiB
*** Warning - bad CRC, using default environment
In: serial@1e784000
Out: serial@1e784000
Err: serial@1e784000
Net: eth0: ethernet@1e660000
Warning: ethernet@1e680000 (eth1) using random MAC address -
4a:e5:9a:4a:c7:c5
, eth1: ethernet@1e680000
Hit any key to stop autoboot: 2
(qemu) info sync-profile
Type Object Call site Wait Time (s)
Count Average (us)
---------------------------------------------------------------------------------------------
condvar 0x561f10c9ef88 util/thread-pool.c:90 10.01196
28 357570.00
BQL mutex 0x561f102362e0 accel/tcg/cputlb.c:1365 0.29496
14248621 0.02
condvar 0x561f110325a0 softmmu/cpus.c:423 0.02231
2 11152.57
BQL mutex 0x561f102362e0 util/rcu.c:269 0.01447
4 3618.60
condvar 0x561f10236240 softmmu/cpus.c:571 0.00010
1 102.19
mutex 0x561f10e9f1c0 tcg/region.c:204 0.00007
3052 0.02
mutex [ 2] chardev/char.c:118 0.00003
1486 0.02
condvar 0x561f10236280 softmmu/cpus.c:642 0.00003
1 29.38
BQL mutex 0x561f102362e0 accel/tcg/cputlb.c:1426 0.00002
973 0.02
BQL mutex 0x561f102362e0 util/main-loop.c:318 0.00001
34 0.41
---------------------------------------------------------------------------------------------
Something in the layout of the FW is making a big difference. One
that could be relevant is that the recent versions are using a device
tree.
There might be no good solution to this issue but I fail to analyze
it correctly. Is there a way to collect information on the usage of
Translation Blocks ?