qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC


From: BALATON Zoltan
Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
Date: Tue, 3 Mar 2020 00:01:44 +0100 (CET)
User-agent: Alpine 2.22 (BSF 395 2020-01-19)

On Mon, 2 Mar 2020, Alex Bennée wrote:
BALATON Zoltan <address@hidden> writes:
On Sun, 1 Mar 2020, Richard Henderson wrote:
On 3/1/20 4:13 PM, Programmingkid wrote:
Ok, I was just looking at Intel's x87 chip documentation. It
supports IEEE 754 floating point operations and exception flags.
This leads me to this question. Would simply taking the host
exception flags and using them to set the PowerPC's FPU's flag be
an acceptable solution to this problem?

In my understanding that's what is currently done, the problem with
PPC as Richard said is the non-sticky versions of some of these bits
which need clearing FP exception status before every FPU op which
seems to be expensive and slower than using softfloat. So to use
hardfloat we either accept that we can't emulate these bits with
hardfloat or we need to do something else than clearing flags and
checking after every FPU op.

While not emulating these bits don't seem to matter for most clients
and other PPC emulations got away with it, QEMU prefers accuracy over
speed even for rarely used features.

No.

The primary issue is the FPSCR.FI flag.  This is not an accumulative bit, per
ieee754, but per operation.

The "hardfloat" option works (with other targets) only with ieee745
accumulative exceptions, when the most common of those exceptions, inexact, has
already been raised.  And thus need not be raised a second time.

Why exactly it's done that way? What are the differences between IEEE
FP implementations that prevents using hardfloat most of the time
instead of only using it in some (although supposedly common) special
cases?

There are a couple of wrinkles. As far as NaN and denormal behaviour
goes we have enough slack in the spec that different guests have
slightly different behaviour. See pickNaN and friends in the soft float
specialisation code. As a result we never try and hand off to hardfloat
for NaNs, Infs and Zeros. Luckily testing for those cases if a fairly
small part of the cost of the calculation.

Also things tend to get unstuck on changes to rounding modes.
Fortunately it doesn't seem to be supper common.

OK but how do these relate to inexact flag and why is that the one that's checked for using hardfloat? Also rounding mode is checked but why can't we set the same mode on host and why only use hardfloat in one specific rounding mode? These two checks seem to further limit hardfloat use beyond the above cases or are these the same?

You can read even more detail in the paper that originally prompted
Emilio's work:

 "supporting the neon and VFP instruction sets in an LLVM-based
  binary translator"
  https://www.thinkmind.org/download.php?articleid=icas_2015_5_20_20033

I've only had a quick look at it but seems to not discuss all details. They say the ARM instruction they wanted to emulate have some non-standard flush-to-zero behaviour where exceptions (including inexact) are handled differently. Is this related to the check above and if yes shouldn't that only apply to ARM target? Other standard compliant target probably should not be limited by this.

They've also found out that clearing and reading host FP flags is "slower than QEMU" which is what we have for PPC currently. They say the solution is to not use host exceptions at all but calculate the exception flags from software looking at inputs and result instead maybe trying different FP ops that test for the exception cases. Unfortunately this paper does not describe how exactly that's done just say maybe it will be described later. It seems like kind of softfloat but using FPU for actual calculation and deduce exeptions without access to intermediate reaults that softfloat may be using. So they can use hardware for calculation which should be the largest part and calculate the flags from software. This way they claim 1.24 to 3.36 times speed up compared to then QEMU (using only softfloat I guess which is still what we have for PPC today).

Per the PowerPC architecture, inexact must be recognized afresh for every
operation.  Which is cheap in hardware but expensive in software.

And once you're done with FI, FR has been and continues to be emulated 
incorrectly.

I think CPUs can also raise exceptions when they detect the condition
in hardware so maybe we should install our FPU exception handler and
set guest flags from that then we don't need to check and won't have
problem with these bits either. Why is that not possible or isn't
done?

One of my original patches did just this:

 Subject: [PATCH] fpu/softfloat: use hardware sqrt if we can (EXPERIMENT!)
 Date: Tue, 20 Feb 2018 21:01:37 +0000
 Message-Id: <address@hidden>

It's this patch:
http://patchwork.ozlabs.org/patch/875764/

This at least shows where to hook in FP exception handling but based on the above paper maybe that's not the best solution after all but may worth a try anyway in case it's simpler than what they did.

The two problems you run into are:

- relying on a trap for inexact will be slow if you keep hitting it

Which is slower? Clearing exception flags before every op and reading them again or trapping for exceptions? I'd expect even if exceptions are common they should be less frequent than every op (otherwise they would not be "exceptional").

- reading host FPU flag registers turns out to be pretty expensive

That's what using exceptions should avoid. If we only need to read and clear flags when exception happens that should be less frequent than doing that for every FP op. Hopefully even with the additional overhead of calling the handler if all the handler has to do is set a corresponding flag in a global.

Regards,
BALATON Zoltan

reply via email to

[Prev in Thread] Current Thread [Next in Thread]