[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)?
From: |
Alex Bennée |
Subject: |
Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)? |
Date: |
Fri, 12 Nov 2021 09:49:28 +0000 |
User-agent: |
mu4e 1.7.4; emacs 28.0.60 |
Garrick Toubassi <gtoubassi@gmail.com> writes:
> I went ahead and created a short repro case which can be found at
> https://github.com/gtoubassi/qemu-spinrepro. Would appreciate
> thoughts from anyone or guidance on how to debug.
Well something weird is going on that is chewing through the code
generation logic. If you run with:
./qemu-system-x86_64 -serial mon:stdio -kernel ~/Downloads/kernel.img
And then C-a c to bring up the monitor you can type "info jit" and see:
(qemu) info jit
Translation buffer state:
gen code size 1063758051/1073736704
TB count 1
TB avg target size 1 max=1 bytes
TB avg host size 64 bytes (expansion ratio: 64.0)
cross page TB count 0 (0%)
direct jump count 0 (0%) (2 jumps=0 0%)
TB hash buckets 1/8192 (0.01% head buckets used)
TB hash occupancy 0.00% avg chain occ. Histogram: [0.0,2.5)%|█
▁|[22.5,25.0]%
TB hash avg chain 1.000 buckets. Histogram: 1|█|1
Statistics:
TB flush count 1
TB invalidate count 237
TLB full flushes 0
TLB partial flushes 514
TLB elided flushes 1748
[TCG profiler not compiled]
the gen code size just grows and grows until an eventual flush. So it's
spending most of it's time in the code generator. Weirdly it's only
translated one block but is happily flushing the code buffer quite
frequently:
Statistics:
TB flush count 14
TB invalidate count 237
TLB full flushes 0
TLB partial flushes 514
TLB elided flushes 1748
[TCG profiler not compiled]
>
> On Tue, Oct 19, 2021 at 3:05 PM Garrick Toubassi <gtoubassi@gmail.com> wrote:
>
> Hello
>
> I have a mystery I haven't been able to run down and would appreciate any
> explanation or advice.
>
> On a mac/intel I am running qemu-system-x86_64 on a simple image which
> bootstraps into 64 bit long mode and then runs a simple
> spin loop (literally for (int i = 0; i < 10000000; i++) {}). This completes
> in ~5 seconds of wall time. After completion it then enters
> user mode (CPL=3) via a fabricated interrupt stack frame and an iretq,
> returning to the same spin loop. In this case it runs about
> 100x faster.
>
> I at first thought maybe the TCG jit somehow isn't kicking in and maybe
> there is some pure interpretation going on but I've run with
> "-trace exec_tb -trace translate_block -d
> out_asm,guest_errors,nochain,int,plugin" and it seems to be running
> "translation blocks", just
> a lot more of them when running the slow loop (or to be more precise running
> one tb many more times according to exec_tb
> logging). Upon inspection the relevant generated assembly is morally
> equivalent between the two as best I can tell. Which implies to
> me its something outside of the tb. I was thinking perhaps its regenerating
> the code every time, but logging doesn't show that.
>
> I also was wondering if something about the MMU implementation might slow
> things down when in user mode? In this case both
> loops are running under the same GDT/page table which just happens to mark
> all pages as "user" pages so that when jumping to
> CPL=3 it will still run.
>
> I can package up a reproducible case if it's helpful but wanted to see if
> there is something obvious I am missing in terms of expected
> behavior before doing that.
>
> Thanks!
>
> gt
--
Alex Bennée
- Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)?,
Alex Bennée <=