|
From: | BALATON Zoltan |
Subject: | Re: [Qemu-ppc] [Qemu-devel] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros |
Date: | Sun, 27 Jan 2019 21:31:10 +0100 (CET) |
User-agent: | Alpine 2.21.9999 (BSF 287 2018-06-16) |
On Sun, 27 Jan 2019, Mark Cave-Ayland wrote:
On 27/01/2019 17:26, Richard Henderson wrote:On 1/27/19 7:19 AM, Mark Cave-Ayland wrote:Could this make the loop slower? I certainly haven't noticed any obvious performance difference during testing (OS X uses merge quite a bit for display rendering), and I'd hope that with a good compiler and modern branch prediction then any effect here would be negligible.I would expect the i < n/2 loop to be faster, because the assignments are unconditional. FWIW.Do you have any idea as to how much faster? Is it something that would show up as significant within the context of QEMU?
I don't have numbers either but since these vector ops are meant to and are used for speeding up repetitive calculations I'd expect it to be run many times which means that even a small difference would add up. So I think it's worth trying to make these optimal also when host vector ops cannot be used.
I don't know about a good benchmark to measure this. Maybe you could try converting some video in Mac OS X or something similar that's known to use AltiVec/VMX. There are also these under MorphOS on mac99:
http://www.amiga-news.de/en/news/AN-2012-02-00011-EN.htmlwhere the mplayer one is mostly VMX bound I think and lame is more dependent on floating point ops but that also has a VMX version (still mainly float I think). I'd copy input file to RAM: disk first to avoid overhead from IDE emulation. But these are probably too short to measure this.
I can't test this now but maybe someone reading this on the list who can try it with and without this series could help.
Regards, BALATON Zoltan
[Prev in Thread] | Current Thread | [Next in Thread] |