I debugged this a bit more and am closer to an explanation. Some crude rdtsc cycle profiling led me to the fact that notdirty_write is called a ton in the slow case, and, in aggregate, takes up most of the time. Running qemu with "-trace memory_notdirty_write_access" shows that in the slow case it is logging this millions of times:
memory_notdirty_write_access 0x7be0 ram_addr 0x7be0 size 4
memory_notdirty_write_access 0x7be0 ram_addr 0x7be0 size 4
memory_notdirty_write_access 0x7be0 ram_addr 0x7be0 size 4
memory_notdirty_write_access 0x7be0 ram_addr 0x7be0 size 4
This is almost certainly the local variable used for the spin loop (which iterates 10M times). Note the address is 0x7be0 which shares the same page as the boot block (0x7c00). I am setting the stack to 0x7c00 to grow down from the boot block based on a suggestion for early bootstrapping from
wiki.osdev.org/MBR_(x86).
So I am doing a lot of writes against a page which is also executable which I expect is part of the problem. The reason it's so fast in user mode has nothing to do with privilege level, it's just that I happen to be using a stack that is not on a page shared with executable code. In fact if I change the slow version's stack to not share a page with executable code (just moving it to 0x6900) it works fine/fast. Again, has nothing to do with privilege level.
This is easily avoided on my end and of course having pages that are both executable and written to is not a common practice.
But the question remains as to whether the behavior is expected that if you have pages which are executable and also written to, the writes will be persistently slow? Or is that a possible (fringe) bug?
Thank you!
gt