Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant

From:	Alexander Monakov
Subject:	Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant
Date:	Mon, 29 Apr 2024 14:29:58 +0300 (MSK)

On Mon, 29 Apr 2024, Daniel P. Berrangé wrote:

> On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> > From: Alexander Monakov <amonakov@ispras.ru>
> > 
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> This is describing limitations of Intel's AVX512 implementation.
> 
> AMD's AVX512 implementation is said to not have the kind of
> power / frequency limitations that Intel's does:
> 
>   https://www.mersenneforum.org/showthread.php?p=614191
> 
>   "Overall, AMD's AVX512 implementation beat my expectations.
>    I was expecting something similar to Zen1's "double-pumping"
>    of AVX with half the register file and cross-lane instructions
>    being super slow. But this is not the case on Zen4. The lack
>    of power or thermal issues combined with stellar shuffle support
>    makes it completely worthwhile to use from a developer standpoint.
>    If your code can vectorize without excessive wasted computation,
>    then go all the way to 512-bit. AMD not only made this worthwhile,
>    but *incentivizes* it with the power savings. And if in the future
>    AMD decides to widen things up, you may get a 2x speedup for free."
> 
> IOW, it sounds like we could be sacrificing performance on modern
> AMD Genoa generation CPUs by removing the AVX512 impl

No, the new implementation saturates load ports, and Genoa runs 512-bit
AVX instructions at half throughput compared to their 256-bit counterparts
(so one 512-bit load or two 256-bit loads per cycle), so there's no
obvious reason why this patch would sacrifice performance there.

Maybe it could, indirectly, by lowering the turbo clock limit due to
higher front-end activity, but I don't have access to a Zen 4 machine
to check, and even so it would be a few percent, not 2x.

Alexander

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH v6 00/10] Optimize buffer_is_zero, Richard Henderson, 2024/04/24
- [PATCH v6 01/10] util/bufferiszero: Remove SSE4.1 variant, Richard Henderson, 2024/04/24
- [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant, Richard Henderson, 2024/04/24
  - Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant, Daniel P . Berrangé, 2024/04/29
    - Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant, Alexander Monakov <=
- [PATCH v6 03/10] util/bufferiszero: Reorganize for early test for acceleration, Richard Henderson, 2024/04/24
- [PATCH v6 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants, Richard Henderson, 2024/04/24
- [PATCH v6 06/10] util/bufferiszero: Improve scalar variant, Richard Henderson, 2024/04/24
  - Re: [PATCH v6 06/10] util/bufferiszero: Improve scalar variant, Philippe Mathieu-Daudé, 2024/04/29
    - Re: [PATCH v6 06/10] util/bufferiszero: Improve scalar variant, Richard Henderson, 2024/04/29
    - Re: [PATCH v6 06/10] util/bufferiszero: Improve scalar variant, Philippe Mathieu-Daudé, 2024/04/29
- [PATCH v6 04/10] util/bufferiszero: Remove useless prefetches, Richard Henderson, 2024/04/24
- [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64, Richard Henderson, 2024/04/24
  - Re: [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64, Philippe Mathieu-Daudé, 2024/04/29
  - Re: [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64, Philippe Mathieu-Daudé, 2024/04/29

Prev by Date: Re: [PATCH v2] mc146818rtc: add a way to generate RTC interrupts via QMP
Next by Date: Re: [PATCH] mc146818rtc: add a way to generate RTC interrupts via QMP
Previous by thread: Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant
Next by thread: [PATCH v6 03/10] util/bufferiszero: Reorganize for early test for acceleration
Index(es):
- Date
- Thread