qemu-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-discuss] Getting qemu-system-i386 to use more than one core on


From: Jakob Bohm
Subject: Re: [Qemu-discuss] Getting qemu-system-i386 to use more than one core on Cortex A7 host
Date: Wed, 6 Jan 2016 01:03:17 +0100
User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0

On 06/01/2016 00:53, Peter Maydell wrote:
On 5 January 2016 at 23:10, Jakob Bohm <address@hidden> wrote:
On 05/01/2016 18:35, Peter Maydell wrote:
(It would also be possible
to use the v8 ARM load-acquire and store-release instructions
rather than full on barriers, but on v7 I think barriers are
the only answer.)


The Load acquire/store if no conflict instruction pair was introduced
halfway through the Armv6 architecture, though it may be missing on
some non-A Armv7 cores, since it is not required for that processor
class.

I think you are thinking of the load-exclusive/store-exclusive
instructions, which did indeed appear in ARMv6 and provide
"only store if no conflict" semantics for implementing atomic
operations. Load-acquire/store-release are different and are
new in ARMv8 -- they are a bit like a normal load/store with a
built-in one-sided barrier: if you do a load-acquire then some normal
loads/stores, other CPUs must see your load-acquire before the
other operations (but loads/stores that happened before the
load-acquire might still be ordered after it). Similarly if
you do some loads and stores followed by a store-release then
other processors must see your store-release last.
(I've simplified rather here, see the architecture manual for
the exact semantics.)


Ahh, sorry, I am not completely up to date on aarch64 assembly yet.

Additionally, I think some ARM MMUs have page or region level
memory ordering flags, including some flag combinations that break
normal Arm synchronization instructions.

This is true but not really important for considering QEMU
running on an ARM host -- all the RAM we get from the host
OS will be Normal memory, not Device or Strongly-ordered.


I was thinking of maybe getting kernel help, e.g. via an extension of
madvise() or similar.

But anyway, it might be worth allowing the P5 reordering rules on x86
if that improves the situation.  It might also be worth doing some "is
the host CPU too aggressively reordering" conditionals both compile
time and runtime, switching between different TCG multi-core strategies
depending on the exact host CPU.

I'm not sure how you would test at runtime whether the CPU might
decide to reorder accesses -- I think you have to assume the
worst case imposed by the architecture.


Basically checking for known safe CPU core models, like if e.g. "cortex
A8" is safe but "cortex A9" is not, we could test for that (with
appropriate OS calls because someone decided to make ARM CPUID a
privileged instruction).


Another tactic could be to not let more than one virtual core have
actual access to the same page if at least one of them has write
access.  So the minority of code that actually does do multi-core data
updates to the same virtualized memory page and might thus be affected
by ordering rules would cause the emulator to constantly switch the
shared page back and forth, while most other code will just run along
nicely using shared read or exclusive write page accesses.

This is an interesting idea; I guess it would need to be
implemented and benchmarked to see if the overhead on typical
workloads was low enough to make it make sense.

But in the end if x86 really makes these guarantees even in multi-
socket setups (more than one physical x86 CPU in a suitable
motherboard), despite the normal effects of caching, while ARM doesn't,
that kind of sucks.  Though we shouldn't forget that those are not
the only 2 architectures involved.

It's just an unfortunate architectural philosophy mismatch,
and ARM and x86 are my usual examples of the two possibilities.
I think most non-x86 architectures go for a weaker memory
model than x86 did, so MIPS and PPC are on the ARM end of
the spectrum.


I suspect it might also be about backward compatibility with millions
of programs that were tested only with single CPU machines where such
ordering would be a natural side effect of shared caches.


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded



reply via email to

[Prev in Thread] Current Thread [Next in Thread]