qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant


From: Alexander Monakov
Subject: Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant
Date: Mon, 29 Apr 2024 14:29:58 +0300 (MSK)

On Mon, 29 Apr 2024, Daniel P. Berrangé wrote:

> On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> > From: Alexander Monakov <amonakov@ispras.ru>
> > 
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> This is describing limitations of Intel's AVX512 implementation.
> 
> AMD's AVX512 implementation is said to not have the kind of
> power / frequency limitations that Intel's does:
> 
>   https://www.mersenneforum.org/showthread.php?p=614191
> 
>   "Overall, AMD's AVX512 implementation beat my expectations.
>    I was expecting something similar to Zen1's "double-pumping"
>    of AVX with half the register file and cross-lane instructions
>    being super slow. But this is not the case on Zen4. The lack
>    of power or thermal issues combined with stellar shuffle support
>    makes it completely worthwhile to use from a developer standpoint.
>    If your code can vectorize without excessive wasted computation,
>    then go all the way to 512-bit. AMD not only made this worthwhile,
>    but *incentivizes* it with the power savings. And if in the future
>    AMD decides to widen things up, you may get a 2x speedup for free."
> 
> IOW, it sounds like we could be sacrificing performance on modern
> AMD Genoa generation CPUs by removing the AVX512 impl

No, the new implementation saturates load ports, and Genoa runs 512-bit
AVX instructions at half throughput compared to their 256-bit counterparts
(so one 512-bit load or two 256-bit loads per cycle), so there's no
obvious reason why this patch would sacrifice performance there.

Maybe it could, indirectly, by lowering the turbo clock limit due to
higher front-end activity, but I don't have access to a Zen 4 machine
to check, and even so it would be a few percent, not 2x.

Alexander

reply via email to

[Prev in Thread] Current Thread [Next in Thread]