[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
From: |
Alex Bennée |
Subject: |
Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public |
Date: |
Thu, 30 Jun 2022 10:43:31 +0100 |
User-agent: |
mu4e 1.7.27; emacs 28.1.50 |
Cédric Le Goater <clg@kaod.org> writes:
> On 6/29/22 20:24, Alex Bennée wrote:
>> Cédric Le Goater <clg@kaod.org> writes:
>>
>>> On 6/29/22 16:14, Alex Bennée wrote:
>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>
>>>>> On 6/24/22 18:50, Cédric Le Goater wrote:
>>>>>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>>>
>>>>>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>>>>>
>>>>>>>> Let's start simple without flash support. We should be able to
>>>>>>>> load FW blobs in each CPU address space using loader devices.
>>>>>>>
>>>>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>>>>>> boot sequence is a little weird. I specifically _needed_ to have
>>>>>>> a flash device which maps the firmware in at 0x2000_0000, because
>>>>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>>>>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>>>>>
>>>>>>> $ ./build/qemu-system-arm -machine fby35 \
>>>>>>> -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>>>>> -d int -drive file=fby35.mtd,format=raw,if=mtd
>>>>>> Ideally we should be booting from the flash device directly using
>>>>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>>>>>> does. Instructions are fetched using SPI transfers. But the amount
>>>>>> of code generated is tremendous.
>>>> Yeah because there is a potential race when reading from HW so we
>>>> throw
>>>> away TB's after executing them because we have no way of knowing if it
>>>> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
>>>> execution code) which cleaned up this handling.
>>>>
>>>>>> See some profiling below for a
>>>>>> run which barely reaches DRAM training in U-Boot.
>>>>>
>>>>> Some more profiling on both ast2500 and ast2600 machines shows :
>>>>>
>>>>>
>>>>> * ast2600-evb,execute-in-place=true :
>>>>>
>>>>> Type Object Call site Wait Time (s)
>>>>> Count Average (us)
>>>>> ---------------------------------------------------------------------------------------------
>>>>> BQL mutex 0x564dc03922e0 accel/tcg/cputlb.c:1365 14.21443
>>>>> 32909927 0.43
>>>> This is unavoidable as a HW access needs the BQL held so we will go
>>>> through this cycle every executed instruction.
>>>> Did I miss why the flash contents are not mapped into the physical
>>>> address space? Isn't that how it appear to the processor?
>>>
>>>
>>> There are two modes :
>>> if (ASPEED_MACHINE(machine)->mmio_exec) {
>>> memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
>>> &fl->mmio, 0, size);
>>> memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>> boot_rom);
>>> } else {
>>> memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
>>> size, &error_abort);
>>> memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>> boot_rom);
>>> write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
>>> }
>>>
>>> The default boot mode uses the ROM. No issue.
>>>
>>> The "execute-in-place=true" option creates an alias on the region of
>>> the flash contents and each instruction is then fetched from the flash
>>> drive with SPI transactions.
I wonder if we were keeping the code before?
>>>
>>> With old FW images, using an older U-boot, the machine boots in a couple
>>> of seconds. See the profiling below for a witherspoon-bmc machine using
>>> U-Boot 2016.07.
>>>
>>> qemu-system-arm -M witherspoon-bmc,execute-in-place=true -drive
>>> file=./flash-witherspoon-bmc,format=raw,if=mtd -drive
>>> file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults
>>> -snapshot -serial mon:stdio -enable-sync-profile
>>> ...
>>> U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
>>> Watchdog enabled
>>> DRAM: 496 MiB
>>> Flash: 32 MiB
>>> In: serial
>>> Out: serial
>>> Err: serial
>>> Net:
>>> (qemu) info sync-profile
>>> Type Object Call site Wait Time (s)
>>> Count Average (us)
>>>
>>> ---------------------------------------------------------------------------------------------
>>> BQL mutex 0x56189610b2e0 accel/tcg/cputlb.c:1365 0.25311
>>> 12346237 0.02
>>> condvar 0x5618970cf220 softmmu/cpus.c:423 0.05506
>>> 2 27530.78
>>> BQL mutex 0x56189610b2e0 util/rcu.c:269 0.04709
>>> 2 23544.26
>>> condvar 0x561896d0fc78 util/thread-pool.c:90 0.01340
>>> 83 161.47
>>> condvar 0x56189610b240 softmmu/cpus.c:571 0.00005
>>> 1 54.93
>>> condvar 0x56189610b280 softmmu/cpus.c:642 0.00003
>>> 1 32.88
>>> BQL mutex 0x56189610b2e0 util/main-loop.c:318 0.00003
>>> 34 0.76
>>> mutex 0x561896eade00 tcg/region.c:204 0.00002
>>> 995 0.02
>>> rec_mutex [ 2] util/async.c:682 0.00002
>>> 493 0.03
>>> mutex [ 2] chardev/char.c:118 0.00001
>>> 404 0.03
>>>
>>> ---------------------------------------------------------------------------------------------
>>>
>>>
>>> However, with recent U-boots, it takes quite a while to reach DRAM training.
>>> Close to a minute. See the profiling below for an ast2500-evb machine using
>>> U-Boot 2019.04.
>>>
>>> qemu-system-arm -M ast2500-evb,execute-in-place=true -net
>>> nic,macaddr=C0:FF:EE:00:00:03,netdev=net0 -drive
>>> file=./flash-ast2500-evb,format=raw,if=mtd -nographic -nodefaults
>>> -snapshot -serial mon:stdio -enable-sync-profile
>>> qemu-system-arm: warning: Aspeed iBT has no chardev backend
>>> qemu-system-arm: warning: nic ftgmac100.1 has no peer
>>> QEMU 7.0.50 monitor - type 'help' for more information
>>> U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
>>> +0200)
>>> SOC : AST2500-A1
>>> RST : Power On
>>> LPC Mode : SIO:Enable : SuperIO-2e
>>> Eth : MAC0: RGMII, , MAC1: RGMII,
>>> Model: AST2500 EVB
>>> DRAM: 448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
>>> MMC: sdhci_slot0@100: 0, sdhci_slot1@200: 1
>>> Loading Environment from SPI Flash... SF: Detected mx25l25635e with page
>>> size 256 Bytes, erase size 64 KiB, total 32 MiB
>>> *** Warning - bad CRC, using default environment
>>> In: serial@1e784000
>>> Out: serial@1e784000
>>> Err: serial@1e784000
>>> Net: eth0: ethernet@1e660000
>>> Warning: ethernet@1e680000 (eth1) using random MAC address -
>>> 4a:e5:9a:4a:c7:c5
>>> , eth1: ethernet@1e680000
>>> Hit any key to stop autoboot: 2
>>> (qemu) info sync-profile
>>> Type Object Call site Wait Time (s)
>>> Count Average (us)
>>>
>>> ---------------------------------------------------------------------------------------------
>>> condvar 0x561f10c9ef88 util/thread-pool.c:90 10.01196
>>> 28 357570.00
>>> BQL mutex 0x561f102362e0 accel/tcg/cputlb.c:1365 0.29496
>>> 14248621 0.02
>>> condvar 0x561f110325a0 softmmu/cpus.c:423 0.02231
>>> 2 11152.57
>>> BQL mutex 0x561f102362e0 util/rcu.c:269 0.01447
>>> 4 3618.60
>>> condvar 0x561f10236240 softmmu/cpus.c:571 0.00010
>>> 1 102.19
>>> mutex 0x561f10e9f1c0 tcg/region.c:204 0.00007
>>> 3052 0.02
>>> mutex [ 2] chardev/char.c:118 0.00003
>>> 1486 0.02
>>> condvar 0x561f10236280 softmmu/cpus.c:642 0.00003
>>> 1 29.38
>>> BQL mutex 0x561f102362e0 accel/tcg/cputlb.c:1426 0.00002
>>> 973 0.02
>>> BQL mutex 0x561f102362e0 util/main-loop.c:318 0.00001
>>> 34 0.41
>>>
>>> ---------------------------------------------------------------------------------------------
>>> Something in the layout of the FW is making a big difference. One
>>> that could be relevant is that the recent versions are using a device
>>> tree.
>>>
>>> There might be no good solution to this issue but I fail to analyze
>>> it correctly. Is there a way to collect information on the usage of
>>> Translation Blocks ?
>> You could expand the data we collect in tb_tree_stats and expose it
>> via
>> info jit.
>
> The "fast" run, U-Boot 2016.07, gives :
>
>
> Translation buffer state:
> gen code size 254880371/1073736704
> TB count 1089
> TB avg target size 16 max=356 bytes
> TB avg host size 278 bytes (expansion ratio: 17.2)
> cross page TB count 0 (0%)
> direct jump count 501 (46%) (2 jumps=372 34%)
> TB hash buckets 1025/8192 (12.51% head buckets used)
> TB hash occupancy 3.32% avg chain occ. Histogram: [0.0,7.5)%|█ ▁ ▁
> ▁|[67.5,75.0]%
> TB hash avg chain 1.000 buckets. Histogram: 1|█|1
> Statistics:
> TB flush count 0
> TB invalidate count 0
> TLB full flushes 0
> TLB partial flushes 2
> TLB elided flushes 2338
> JIT cycles 2221788409 (0.926 s at 2.4 GHz)
> translated TBs 738520 (aborted=0 0.0%)
> avg ops/TB 15.7 max=459
> deleted ops/TB 2.72
> avg temps/TB 32.89 max=88
> avg host code/TB 113.7
> avg search data/TB 5.2
> cycles/op 192.0
> cycles/in byte 748.7
> cycles/out byte 26.4
> cycles/search byte 582.8
> gen_interm time 57.6%
> gen_code time 42.4%
> optim./code time 19.4%
> liveness/code time 26.1%
> cpu_restore count 0
> avg cycles 0.0
> and the "slow", U-Boot 2019.04 :
>
> Translation buffer state:
> gen code size 368603795/1073736704
> TB count 3052
> TB avg target size 16 max=360 bytes
> TB avg host size 293 bytes (expansion ratio: 17.6)
> cross page TB count 0 (0%)
> direct jump count 1431 (46%) (2 jumps=1104 36%)
> TB hash buckets 2559/8192 (31.24% head buckets used)
> TB hash occupancy 9.31% avg chain occ. Histogram: [0,10)%|█ ▃ ▁ ▁
> ▁|[90,100]%
> TB hash avg chain 1.000 buckets. Histogram: 1|█|1
>
> Statistics:
> TB flush count 3
Something triggers 3 complete TB flushes in the slow run (3x1000 ~ 3000
TBs). This can be the TB overflowing but:
fast: gen code size 254880371/1073736704
slow: gen code size 368603795/1073736704
doesn't look like we are getting close to overflowing. Must be something
else.
> TB invalidate count 0
> TLB full flushes 0
> TLB partial flushes 3
> TLB elided flushes 2367
> JIT cycles 26479044772 (11.033 s at 2.4 GHz)
> translated TBs 10552169 (aborted=0 0.0%)
> avg ops/TB 15.0 max=464
> deleted ops/TB 2.44
> avg temps/TB 32.43 max=89
> avg host code/TB 99.0
> avg search data/TB 5.0
> cycles/op 167.7
> cycles/in byte 626.8
> cycles/out byte 25.4
> cycles/search byte 499.4
> gen_interm time 50.4%
> gen_code time 49.6%
> optim./code time 19.5%
> liveness/code time 27.7%
> cpu_restore count 0
> avg cycles 0.0
>
>
> A lot more TBs.
> C.
--
Alex Bennée
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public, Cédric Le Goater, 2022/06/23
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public, Peter Delevoryas, 2022/06/23
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public, Cédric Le Goater, 2022/06/24
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public, Cédric Le Goater, 2022/06/29
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public, Alex Bennée, 2022/06/29
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public, Cédric Le Goater, 2022/06/29
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public, Alex Bennée, 2022/06/29
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public, Cédric Le Goater, 2022/06/30
- Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public,
Alex Bennée <=