Re: notdirty_write thrashing in simple for loop

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: notdirty_write thrashing in simple for loop

From:	BALATON Zoltan
Subject:	Re: notdirty_write thrashing in simple for loop
Date:	Sun, 23 May 2021 15:41:32 +0200 (CEST)

Hello,

On Tue, 18 May 2021, Mark Watson wrote:

I'm trying to implement my own machine for amiga emulation using a software
cpu and fpga hardware. For this I have built my own machine which consists
of a large malloced ram block and some fpga hardware mmapped elsewhere into
the memory space.

I'm using qemu to emulate a 68040 on an arm cortex a9 host in system mode.

It is working, though I'm investigating a strange performance issue.

I'm looking for advice on where to look next in debugging this from the
specialist(s) of accel/tcg/cputlb.c please.

I think you need to be more specific about details or even better providea way to reproduce it without your hardware if possible otherwise peoplewill not get what you're seeing. From the above it's not clear to me ifyou're emulating an fpga hardware in QEMU or actually run with the fpga(supposedly implementing the Amiga chipset) in the virtual machine'smemory so accesses to some addresses will do something in hardware (inwhich case it may be difficult to reproduce without it and also could bethe source of problems so hard to tell what might be causing your issue.)

(Is this related to pistorm or something based on that for full Amigaemulation without Amiga hardware? Just insterested, unrelated to thisthread.)

To investigate the performance issue I tried to break it down to the
simplest possible case. I can reproduce it with a simple for loop (compiled
without optimisation).
       for (int i=0;i!=0xffffff;++i)
{
if ((i&0xffff)==0)
{
}
}

So you do nothing in the loop just test for the loop variable and thissometimes runs slow?

Running it in user mode on the same host it takes ~0.6 seconds. In the
built-in 'virtual' m68k machine running linux it takes 1.3 seconds.
However in my machine under amigaos I'm seeing it typically taking 5 and a
half minutes! Occasionally it seems to run at the correct speed of <2
seconds, though I have yet to identify why. These are the logs of the
captured code before it goes into the main chain loop.
qemu_slow_stuck_fragment.log
<http://www.64kib.com/qemu_slow_stuck_fragment.log>

The log does not make much sense to me but I'm also not an expert on TCGand ARM. Why do you have faults while running a simple empty loop and whatare those? Is something flushing the TLB for some reason or is this justbecuase of the debug logging? I think there are some -d oprions for mmudebugging that may give more info on TLB usage.

I have verified that this performance change is not due to slow fpga memory
area access, i.e. there are no accesses to that memory region during this.

OK so then it should be possible to reproduce without that hardware? If sothat would help people to understand the issue and give advice but Isee that reproducing may need understanding the issue first.

I took a look in gdb while running this loop to see what is going on.
Initially I was surprised that I didn't find the code in 'OUT:', however I
guess it makes sense that it has to call into the framework for memory
access. I noticed that a lot of calls to glib are made and see

I rarely use gdb with QEMU so not sure but normally with TCG in_asm andout_asm debug you'll only see these when the TB is first translated notwhen you run it later because then the translated code is run from the TBcache. I think you can kind of disable this with -singlestep that makesTBs just a single instruction and may change caching. At least with that Isee all instructions all the time in -d in_asm so this may help debuggingalthough it makes things much slower.

g_tree_lookup called a lot. This is caused by notdirty_write being called
'000s of times and each time going into the page_collection_lock and
tb_invalidate_phys_page_fast. I presume this is happening each time that
"i" is incremented on the stack, which clearly has a huge overhead.

There are only a few places notdirty_write is called from so you should beable to identify which of those is firing (if all else fails you could adddebug logs but there may be trace points to enable too). Once we get whichplace it's coming from then maybe people could tell why that could happen.Don't know if you already know QEMU debug options, I have some thingscollected here that I've used while implementing machine emulation here:


https://osdn.net/projects/qmiga/wiki/DeveloperTips

Even being able to get a proper stack trace from gdb would be very helpful
to understand this. I tried to configure qemu with '--enable-debug' but
still do not get a proper stack if i attach to it. I'm not sure if this is
the case due to it running dynamically compiled code before calling into
this.

The --enable-debug adds debug symbols to QEMU but if it's called fromgenerated code then you'll probably see that as the source of the calls sohard to tell what has put that there. Yet it may help if you could showsome back traces you got in case that makes sense to somebody who knowsabout TCG. Also verify that these excessive calls to notdirty_write doesonly happen when it's running slow so it's really the source of theproblems and not something normal otherwise.

Sorry I can't give any more useful advice but maybe the above give yousome idea on how to debug this further.


Regards,
BALATON Zoltan

[Prev in Thread]

Current Thread

[Next in Thread]

notdirty_write thrashing in simple for loop, Mark Watson, 2021/05/18
- Re: notdirty_write thrashing in simple for loop, BALATON Zoltan <=
  - Re: notdirty_write thrashing in simple for loop, Mark Watson, 2021/05/23
    - Re: notdirty_write thrashing in simple for loop, BALATON Zoltan, 2021/05/23

Prev by Date: Re: The latest Qemu release can't bootup VM with latest guest kernel.
Next by Date: Re: notdirty_write thrashing in simple for loop
Previous by thread: notdirty_write thrashing in simple for loop
Next by thread: Re: notdirty_write thrashing in simple for loop
Index(es):
- Date
- Thread