qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: notdirty_write thrashing in simple for loop


From: BALATON Zoltan
Subject: Re: notdirty_write thrashing in simple for loop
Date: Sun, 23 May 2021 15:41:32 +0200 (CEST)

Hello,

On Tue, 18 May 2021, Mark Watson wrote:
I'm trying to implement my own machine for amiga emulation using a software
cpu and fpga hardware. For this I have built my own machine which consists
of a large malloced ram block and some fpga hardware mmapped elsewhere into
the memory space.

I'm using qemu to emulate a 68040 on an arm cortex a9 host in system mode.

It is working, though I'm investigating a strange performance issue.

I'm looking for advice on where to look next in debugging this from the
specialist(s) of accel/tcg/cputlb.c please.

I think you need to be more specific about details or even better provide a way to reproduce it without your hardware if possible otherwise people will not get what you're seeing. From the above it's not clear to me if you're emulating an fpga hardware in QEMU or actually run with the fpga (supposedly implementing the Amiga chipset) in the virtual machine's memory so accesses to some addresses will do something in hardware (in which case it may be difficult to reproduce without it and also could be the source of problems so hard to tell what might be causing your issue.)

(Is this related to pistorm or something based on that for full Amiga emulation without Amiga hardware? Just insterested, unrelated to this thread.)

To investigate the performance issue I tried to break it down to the
simplest possible case. I can reproduce it with a simple for loop (compiled
without optimisation).
       for (int i=0;i!=0xffffff;++i)
{
if ((i&0xffff)==0)
{
}
}

So you do nothing in the loop just test for the loop variable and this sometimes runs slow?

Running it in user mode on the same host it takes ~0.6 seconds. In the
built-in 'virtual' m68k machine running linux it takes 1.3 seconds.
However in my machine under amigaos I'm seeing it typically taking 5 and a
half minutes! Occasionally it seems to run at the correct speed of <2
seconds, though I have yet to identify why. These are the logs of the
captured code before it goes into the main chain loop.
qemu_slow_stuck_fragment.log
<http://www.64kib.com/qemu_slow_stuck_fragment.log>

The log does not make much sense to me but I'm also not an expert on TCG and ARM. Why do you have faults while running a simple empty loop and what are those? Is something flushing the TLB for some reason or is this just becuase of the debug logging? I think there are some -d oprions for mmu debugging that may give more info on TLB usage.

I have verified that this performance change is not due to slow fpga memory
area access, i.e. there are no accesses to that memory region during this.

OK so then it should be possible to reproduce without that hardware? If so that would help people to understand the issue and give advice but I see that reproducing may need understanding the issue first.

I took a look in gdb while running this loop to see what is going on.
Initially I was surprised that I didn't find the code in 'OUT:', however I
guess it makes sense that it has to call into the framework for memory
access. I noticed that a lot of calls to glib are made and see

I rarely use gdb with QEMU so not sure but normally with TCG in_asm and out_asm debug you'll only see these when the TB is first translated not when you run it later because then the translated code is run from the TB cache. I think you can kind of disable this with -singlestep that makes TBs just a single instruction and may change caching. At least with that I see all instructions all the time in -d in_asm so this may help debugging although it makes things much slower.

g_tree_lookup called a lot. This is caused by notdirty_write being called
'000s of times and each time going into the page_collection_lock and
tb_invalidate_phys_page_fast. I presume this is happening each time that
"i" is incremented on the stack, which clearly has a huge overhead.

There are only a few places notdirty_write is called from so you should be able to identify which of those is firing (if all else fails you could add debug logs but there may be trace points to enable too). Once we get which place it's coming from then maybe people could tell why that could happen. Don't know if you already know QEMU debug options, I have some things collected here that I've used while implementing machine emulation here:

https://osdn.net/projects/qmiga/wiki/DeveloperTips

Even being able to get a proper stack trace from gdb would be very helpful
to understand this. I tried to configure qemu with '--enable-debug' but
still do not get a proper stack if i attach to it. I'm not sure if this is
the case due to it running dynamically compiled code before calling into
this.

The --enable-debug adds debug symbols to QEMU but if it's called from generated code then you'll probably see that as the source of the calls so hard to tell what has put that there. Yet it may help if you could show some back traces you got in case that makes sense to somebody who knows about TCG. Also verify that these excessive calls to notdirty_write does only happen when it's running slow so it's really the source of the problems and not something normal otherwise.

Sorry I can't give any more useful advice but maybe the above give you some idea on how to debug this further.

Regards,
BALATON Zoltan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]