|
From: | Cédric Le Goater |
Subject: | Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public |
Date: | Tue, 5 Jul 2022 14:35:44 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 |
Hello Alex, On 6/30/22 11:43, Alex Bennée wrote:
Cédric Le Goater <clg@kaod.org> writes:On 6/29/22 20:24, Alex Bennée wrote:Cédric Le Goater <clg@kaod.org> writes:On 6/29/22 16:14, Alex Bennée wrote:Cédric Le Goater <clg@kaod.org> writes:On 6/24/22 18:50, Cédric Le Goater wrote:On 6/23/22 20:43, Peter Delevoryas wrote:On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote: On 6/23/22 12:26, Peter Delevoryas wrote:Signed-off-by: Peter Delevoryas <pdel@fb.com>Let's start simple without flash support. We should be able to load FW blobs in each CPU address space using loader devices.Actually, I was unable to do this, perhaps because the fb OpenBMC boot sequence is a little weird. I specifically _needed_ to have a flash device which maps the firmware in at 0x2000_0000, because the fb OpenBMC U-Boot SPL jumps to that address to start executing from flash? I think this is also why fb OpenBMC machines can be so slow. $ ./build/qemu-system-arm -machine fby35 \ -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \ -d int -drive file=fby35.mtd,format=raw,if=mtdIdeally we should be booting from the flash device directly using the machine option '-M ast2600-evb,execute-in-place=true' like HW does. Instructions are fetched using SPI transfers. But the amount of code generated is tremendous.Yeah because there is a potential race when reading from HW so we throw away TB's after executing them because we have no way of knowing if it has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM execution code) which cleaned up this handling.See some profiling below for a run which barely reaches DRAM training in U-Boot.Some more profiling on both ast2500 and ast2600 machines shows : * ast2600-evb,execute-in-place=true : Type Object Call site Wait Time (s) Count Average (us) --------------------------------------------------------------------------------------------- BQL mutex 0x564dc03922e0 accel/tcg/cputlb.c:1365 14.21443 32909927 0.43This is unavoidable as a HW access needs the BQL held so we will go through this cycle every executed instruction. Did I miss why the flash contents are not mapped into the physical address space? Isn't that how it appear to the processor?There are two modes : if (ASPEED_MACHINE(machine)->mmio_exec) { memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom", &fl->mmio, 0, size); memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR, boot_rom); } else { memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom", size, &error_abort); memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR, boot_rom); write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort); } The default boot mode uses the ROM. No issue. The "execute-in-place=true" option creates an alias on the region of the flash contents and each instruction is then fetched from the flash drive with SPI transactions.I wonder if we were keeping the code before?With old FW images, using an older U-boot, the machine boots in a couple of seconds. See the profiling below for a witherspoon-bmc machine using U-Boot 2016.07. qemu-system-arm -M witherspoon-bmc,execute-in-place=true -drive file=./flash-witherspoon-bmc,format=raw,if=mtd -drive file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults -snapshot -serial mon:stdio -enable-sync-profile ... U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200) Watchdog enabled DRAM: 496 MiB Flash: 32 MiB In: serial Out: serial Err: serial Net: (qemu) info sync-profile Type Object Call site Wait Time (s) Count Average (us) --------------------------------------------------------------------------------------------- BQL mutex 0x56189610b2e0 accel/tcg/cputlb.c:1365 0.25311 12346237 0.02 condvar 0x5618970cf220 softmmu/cpus.c:423 0.05506 2 27530.78 BQL mutex 0x56189610b2e0 util/rcu.c:269 0.04709 2 23544.26 condvar 0x561896d0fc78 util/thread-pool.c:90 0.01340 83 161.47 condvar 0x56189610b240 softmmu/cpus.c:571 0.00005 1 54.93 condvar 0x56189610b280 softmmu/cpus.c:642 0.00003 1 32.88 BQL mutex 0x56189610b2e0 util/main-loop.c:318 0.00003 34 0.76 mutex 0x561896eade00 tcg/region.c:204 0.00002 995 0.02 rec_mutex [ 2] util/async.c:682 0.00002 493 0.03 mutex [ 2] chardev/char.c:118 0.00001 404 0.03 --------------------------------------------------------------------------------------------- However, with recent U-boots, it takes quite a while to reach DRAM training. Close to a minute. See the profiling below for an ast2500-evb machine using U-Boot 2019.04. qemu-system-arm -M ast2500-evb,execute-in-place=true -net nic,macaddr=C0:FF:EE:00:00:03,netdev=net0 -drive file=./flash-ast2500-evb,format=raw,if=mtd -nographic -nodefaults -snapshot -serial mon:stdio -enable-sync-profile qemu-system-arm: warning: Aspeed iBT has no chardev backend qemu-system-arm: warning: nic ftgmac100.1 has no peer QEMU 7.0.50 monitor - type 'help' for more information U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03 +0200) SOC : AST2500-A1 RST : Power On LPC Mode : SIO:Enable : SuperIO-2e Eth : MAC0: RGMII, , MAC1: RGMII, Model: AST2500 EVB DRAM: 448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off) MMC: sdhci_slot0@100: 0, sdhci_slot1@200: 1 Loading Environment from SPI Flash... SF: Detected mx25l25635e with page size 256 Bytes, erase size 64 KiB, total 32 MiB *** Warning - bad CRC, using default environment In: serial@1e784000 Out: serial@1e784000 Err: serial@1e784000 Net: eth0: ethernet@1e660000 Warning: ethernet@1e680000 (eth1) using random MAC address - 4a:e5:9a:4a:c7:c5 , eth1: ethernet@1e680000 Hit any key to stop autoboot: 2 (qemu) info sync-profile Type Object Call site Wait Time (s) Count Average (us) --------------------------------------------------------------------------------------------- condvar 0x561f10c9ef88 util/thread-pool.c:90 10.01196 28 357570.00 BQL mutex 0x561f102362e0 accel/tcg/cputlb.c:1365 0.29496 14248621 0.02 condvar 0x561f110325a0 softmmu/cpus.c:423 0.02231 2 11152.57 BQL mutex 0x561f102362e0 util/rcu.c:269 0.01447 4 3618.60 condvar 0x561f10236240 softmmu/cpus.c:571 0.00010 1 102.19 mutex 0x561f10e9f1c0 tcg/region.c:204 0.00007 3052 0.02 mutex [ 2] chardev/char.c:118 0.00003 1486 0.02 condvar 0x561f10236280 softmmu/cpus.c:642 0.00003 1 29.38 BQL mutex 0x561f102362e0 accel/tcg/cputlb.c:1426 0.00002 973 0.02 BQL mutex 0x561f102362e0 util/main-loop.c:318 0.00001 34 0.41 --------------------------------------------------------------------------------------------- Something in the layout of the FW is making a big difference. One that could be relevant is that the recent versions are using a device tree. There might be no good solution to this issue but I fail to analyze it correctly. Is there a way to collect information on the usage of Translation Blocks ?You could expand the data we collect in tb_tree_stats and expose it via info jit.The "fast" run, U-Boot 2016.07, gives : Translation buffer state: gen code size 254880371/1073736704 TB count 1089 TB avg target size 16 max=356 bytes TB avg host size 278 bytes (expansion ratio: 17.2) cross page TB count 0 (0%) direct jump count 501 (46%) (2 jumps=372 34%) TB hash buckets 1025/8192 (12.51% head buckets used) TB hash occupancy 3.32% avg chain occ. Histogram: [0.0,7.5)%|█ ▁ ▁ ▁|[67.5,75.0]% TB hash avg chain 1.000 buckets. Histogram: 1|█|1 Statistics: TB flush count 0 TB invalidate count 0 TLB full flushes 0 TLB partial flushes 2 TLB elided flushes 2338 JIT cycles 2221788409 (0.926 s at 2.4 GHz) translated TBs 738520 (aborted=0 0.0%) avg ops/TB 15.7 max=459 deleted ops/TB 2.72 avg temps/TB 32.89 max=88 avg host code/TB 113.7 avg search data/TB 5.2 cycles/op 192.0 cycles/in byte 748.7 cycles/out byte 26.4 cycles/search byte 582.8 gen_interm time 57.6% gen_code time 42.4% optim./code time 19.4% liveness/code time 26.1% cpu_restore count 0 avg cycles 0.0 and the "slow", U-Boot 2019.04 : Translation buffer state: gen code size 368603795/1073736704 TB count 3052 TB avg target size 16 max=360 bytes TB avg host size 293 bytes (expansion ratio: 17.6) cross page TB count 0 (0%) direct jump count 1431 (46%) (2 jumps=1104 36%) TB hash buckets 2559/8192 (31.24% head buckets used) TB hash occupancy 9.31% avg chain occ. Histogram: [0,10)%|█ ▃ ▁ ▁ ▁|[90,100]% TB hash avg chain 1.000 buckets. Histogram: 1|█|1 Statistics: TB flush count 3Something triggers 3 complete TB flushes in the slow run (3x1000 ~ 3000 TBs). This can be the TB overflowing but: fast: gen code size 254880371/1073736704 slow: gen code size 368603795/1073736704 doesn't look like we are getting close to overflowing. Must be something else.
Here are images to reproduce. These machines use the same AST2500 SoC. * U-Boot 2016.07 wget https://github.com/openbmc/openbmc/releases/download/2.9.0/obmc-phosphor-image-romulus.static.mtd qemu-system-arm -M romulus-bmc -drive file=./obmc-phosphor-image-romulus.static.mtd,format=raw,if=mtd -nographic qemu-system-arm -M romulus-bmc,execute-in-place=true -drive file=./obmc-phosphor-image-romulus.static.mtd,format=raw,if=mtd -nographic * U-Boot 2019.04 https://github.com/legoater/qemu-aspeed-boot/raw/master/images/ast2500-evb/buildroot-2022.05/flash.img qemu-system-arm -M ast2500-evb -drive file=./flash.img,format=raw,if=mtd -nographic qemu-system-arm -M ast2500-evb,execute-in-place=true -drive file=./flash.img,format=raw,if=mtd -nographic Thanks, C.
TB invalidate count 0 TLB full flushes 0 TLB partial flushes 3 TLB elided flushes 2367 JIT cycles 26479044772 (11.033 s at 2.4 GHz) translated TBs 10552169 (aborted=0 0.0%) avg ops/TB 15.0 max=464 deleted ops/TB 2.44 avg temps/TB 32.43 max=89 avg host code/TB 99.0 avg search data/TB 5.0 cycles/op 167.7 cycles/in byte 626.8 cycles/out byte 25.4 cycles/search byte 499.4 gen_interm time 50.4% gen_code time 49.6% optim./code time 19.5% liveness/code time 27.7% cpu_restore count 0 avg cycles 0.0 A lot more TBs. C.
[Prev in Thread] | Current Thread | [Next in Thread] |