qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public


From: Alex Bennée
Subject: Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
Date: Thu, 30 Jun 2022 10:43:31 +0100
User-agent: mu4e 1.7.27; emacs 28.1.50

Cédric Le Goater <clg@kaod.org> writes:

> On 6/29/22 20:24, Alex Bennée wrote:
>> Cédric Le Goater <clg@kaod.org> writes:
>> 
>>> On 6/29/22 16:14, Alex Bennée wrote:
>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>
>>>>> On 6/24/22 18:50, Cédric Le Goater wrote:
>>>>>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>>>
>>>>>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>>>>>
>>>>>>>> Let's start simple without flash support. We should be able to
>>>>>>>> load FW blobs in each CPU address space using loader devices.
>>>>>>>
>>>>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>>>>>> boot sequence is a little weird. I specifically _needed_ to have
>>>>>>> a flash device which maps the firmware in at 0x2000_0000, because
>>>>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>>>>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>>>>>
>>>>>>> $ ./build/qemu-system-arm -machine fby35 \
>>>>>>>        -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>>>>>        -d int -drive file=fby35.mtd,format=raw,if=mtd
>>>>>> Ideally we should be booting from the flash device directly using
>>>>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>>>>>> does. Instructions are fetched using SPI transfers. But the amount
>>>>>> of code generated is tremendous.
>>>> Yeah because there is a potential race when reading from HW so we
>>>> throw
>>>> away TB's after executing them because we have no way of knowing if it
>>>> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
>>>> execution code) which cleaned up this handling.
>>>>
>>>>>> See some profiling below for a
>>>>>> run which barely reaches DRAM training in U-Boot.
>>>>>
>>>>> Some more profiling on both ast2500 and ast2600 machines shows :
>>>>>
>>>>>
>>>>> * ast2600-evb,execute-in-place=true :
>>>>>
>>>>> Type               Object  Call site                Wait Time (s)         
>>>>> Count  Average (us)
>>>>> ---------------------------------------------------------------------------------------------
>>>>> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
>>>>> 32909927          0.43
>>>> This is unavoidable as a HW access needs the BQL held so we will go
>>>> through this cycle every executed instruction.
>>>> Did I miss why the flash contents are not mapped into the physical
>>>> address space? Isn't that how it appear to the processor?
>>>
>>>
>>> There are two modes :
>>>            if (ASPEED_MACHINE(machine)->mmio_exec) {
>>>              memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
>>>                                       &fl->mmio, 0, size);
>>>              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>>                                          boot_rom);
>>>          } else {
>>>              memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
>>>                                     size, &error_abort);
>>>              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>>                                          boot_rom);
>>>              write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
>>>          }
>>>
>>> The default boot mode uses the ROM. No issue.
>>>
>>> The "execute-in-place=true" option creates an alias on the region of
>>> the flash contents and each instruction is then fetched from the flash
>>> drive with SPI transactions.

I wonder if we were keeping the code before?

>>>
>>> With old FW images, using an older U-boot, the machine boots in a couple
>>> of seconds. See the profiling below for a witherspoon-bmc machine using
>>> U-Boot 2016.07.
>>>
>>>    qemu-system-arm -M witherspoon-bmc,execute-in-place=true  -drive 
>>> file=./flash-witherspoon-bmc,format=raw,if=mtd -drive 
>>> file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults 
>>> -snapshot -serial mon:stdio -enable-sync-profile
>>>    ...
>>>    U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
>>>              Watchdog enabled
>>>    DRAM:  496 MiB
>>>    Flash: 32 MiB
>>>    In:    serial
>>>    Out:   serial
>>>    Err:   serial
>>>    Net:
>>>    (qemu) info sync-profile
>>>    Type               Object  Call site                Wait Time (s)        
>>>  Count  Average (us)
>>>    
>>> ---------------------------------------------------------------------------------------------
>>>    BQL mutex  0x56189610b2e0  accel/tcg/cputlb.c:1365        0.25311      
>>> 12346237          0.02
>>>    condvar    0x5618970cf220  softmmu/cpus.c:423             0.05506        
>>>      2      27530.78
>>>    BQL mutex  0x56189610b2e0  util/rcu.c:269                 0.04709        
>>>      2      23544.26
>>>    condvar    0x561896d0fc78  util/thread-pool.c:90          0.01340        
>>>     83        161.47
>>>    condvar    0x56189610b240  softmmu/cpus.c:571             0.00005        
>>>      1         54.93
>>>    condvar    0x56189610b280  softmmu/cpus.c:642             0.00003        
>>>      1         32.88
>>>    BQL mutex  0x56189610b2e0  util/main-loop.c:318           0.00003        
>>>     34          0.76
>>>    mutex      0x561896eade00  tcg/region.c:204               0.00002        
>>>    995          0.02
>>>    rec_mutex  [           2]  util/async.c:682               0.00002        
>>>    493          0.03
>>>    mutex      [           2]  chardev/char.c:118             0.00001        
>>>    404          0.03
>>>    
>>> ---------------------------------------------------------------------------------------------
>>>
>>>
>>> However, with recent U-boots, it takes quite a while to reach DRAM training.
>>> Close to a minute. See the profiling below for an ast2500-evb machine using
>>> U-Boot 2019.04.
>>>
>>>   qemu-system-arm -M ast2500-evb,execute-in-place=true  -net 
>>> nic,macaddr=C0:FF:EE:00:00:03,netdev=net0  -drive 
>>> file=./flash-ast2500-evb,format=raw,if=mtd  -nographic -nodefaults 
>>> -snapshot -serial mon:stdio  -enable-sync-profile
>>>    qemu-system-arm: warning: Aspeed iBT has no chardev backend
>>>    qemu-system-arm: warning: nic ftgmac100.1 has no peer
>>>    QEMU 7.0.50 monitor - type 'help' for more information
>>>       U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
>>>      +0200)
>>>       SOC : AST2500-A1
>>>    RST : Power On
>>>    LPC Mode : SIO:Enable : SuperIO-2e
>>>    Eth : MAC0: RGMII, , MAC1: RGMII,
>>>    Model: AST2500 EVB
>>>    DRAM:  448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
>>>    MMC:   sdhci_slot0@100: 0, sdhci_slot1@200: 1
>>>    Loading Environment from SPI Flash... SF: Detected mx25l25635e with page 
>>> size 256 Bytes, erase size 64 KiB, total 32 MiB
>>>    *** Warning - bad CRC, using default environment
>>>       In:    serial@1e784000
>>>    Out:   serial@1e784000
>>>    Err:   serial@1e784000
>>>    Net:   eth0: ethernet@1e660000
>>>    Warning: ethernet@1e680000 (eth1) using random MAC address - 
>>> 4a:e5:9a:4a:c7:c5
>>>    , eth1: ethernet@1e680000
>>>    Hit any key to stop autoboot:  2
>>>    (qemu) info sync-profile
>>>    Type               Object  Call site                Wait Time (s)        
>>>  Count  Average (us)
>>>    
>>> ---------------------------------------------------------------------------------------------
>>>    condvar    0x561f10c9ef88  util/thread-pool.c:90         10.01196        
>>>     28     357570.00
>>>    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1365        0.29496      
>>> 14248621          0.02
>>>    condvar    0x561f110325a0  softmmu/cpus.c:423             0.02231        
>>>      2      11152.57
>>>    BQL mutex  0x561f102362e0  util/rcu.c:269                 0.01447        
>>>      4       3618.60
>>>    condvar    0x561f10236240  softmmu/cpus.c:571             0.00010        
>>>      1        102.19
>>>    mutex      0x561f10e9f1c0  tcg/region.c:204               0.00007        
>>>   3052          0.02
>>>    mutex      [           2]  chardev/char.c:118             0.00003        
>>>   1486          0.02
>>>    condvar    0x561f10236280  softmmu/cpus.c:642             0.00003        
>>>      1         29.38
>>>    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1426        0.00002        
>>>    973          0.02
>>>    BQL mutex  0x561f102362e0  util/main-loop.c:318           0.00001        
>>>     34          0.41
>>>    
>>> ---------------------------------------------------------------------------------------------
>>>    Something in the layout of the FW is making a big difference. One
>>> that could be relevant is that the recent versions are using a device
>>> tree.
>>>
>>> There might be no good solution to this issue but I fail to analyze
>>> it correctly. Is there a way to collect information on the usage of
>>> Translation Blocks ?
>> You could expand the data we collect in tb_tree_stats and expose it
>> via
>> info jit.
>
> The "fast" run, U-Boot 2016.07, gives :
>
>
>   Translation buffer state:
>   gen code size       254880371/1073736704
>   TB count            1089
>   TB avg target size  16 max=356 bytes
>   TB avg host size    278 bytes (expansion ratio: 17.2)
>   cross page TB count 0 (0%)
>   direct jump count   501 (46%) (2 jumps=372 34%)
>   TB hash buckets     1025/8192 (12.51% head buckets used)
>   TB hash occupancy   3.32% avg chain occ. Histogram: [0.0,7.5)%|█  ▁  ▁  
> ▁|[67.5,75.0]%
>   TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>      Statistics:
>   TB flush count      0
>   TB invalidate count 0
>   TLB full flushes    0
>   TLB partial flushes 2
>   TLB elided flushes  2338
>   JIT cycles          2221788409 (0.926 s at 2.4 GHz)
>   translated TBs      738520 (aborted=0 0.0%)
>   avg ops/TB          15.7 max=459
>   deleted ops/TB      2.72
>   avg temps/TB        32.89 max=88
>   avg host code/TB    113.7
>   avg search data/TB  5.2
>   cycles/op           192.0
>   cycles/in byte      748.7
>   cycles/out byte     26.4
>   cycles/search byte     582.8
>     gen_interm time   57.6%
>     gen_code time     42.4%
>   optim./code time    19.4%
>   liveness/code time  26.1%
>   cpu_restore count   0
>     avg cycles        0.0
>   and the "slow", U-Boot 2019.04 :
>
> Translation buffer state:
> gen code size       368603795/1073736704
> TB count            3052
> TB avg target size  16 max=360 bytes
> TB avg host size    293 bytes (expansion ratio: 17.6)
> cross page TB count 0 (0%)
> direct jump count   1431 (46%) (2 jumps=1104 36%)
> TB hash buckets     2559/8192 (31.24% head buckets used)
> TB hash occupancy   9.31% avg chain occ. Histogram: [0,10)%|█ ▃  ▁ ▁ 
> ▁|[90,100]%
> TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>
> Statistics:
> TB flush count      3

Something triggers 3 complete TB flushes in the slow run (3x1000 ~ 3000
TBs). This can be the TB overflowing but:

fast: gen code size       254880371/1073736704
slow: gen code size       368603795/1073736704

doesn't look like we are getting close to overflowing. Must be something
else.

> TB invalidate count 0
> TLB full flushes    0
> TLB partial flushes 3
> TLB elided flushes  2367
> JIT cycles          26479044772 (11.033 s at 2.4 GHz)
> translated TBs      10552169 (aborted=0 0.0%)
> avg ops/TB          15.0 max=464
> deleted ops/TB      2.44
> avg temps/TB        32.43 max=89
> avg host code/TB    99.0
> avg search data/TB  5.0
> cycles/op           167.7
> cycles/in byte      626.8
> cycles/out byte     25.4
> cycles/search byte     499.4
>   gen_interm time   50.4%
>   gen_code time     49.6%
> optim./code time    19.5%
> liveness/code time  27.7%
> cpu_restore count   0
>   avg cycles        0.0
>
>
> A lot more TBs.
>   C.


-- 
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]