[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant
From: |
Alexander Monakov |
Subject: |
Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant |
Date: |
Mon, 29 Apr 2024 14:29:58 +0300 (MSK) |
On Mon, 29 Apr 2024, Daniel P. Berrangé wrote:
> On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> > From: Alexander Monakov <amonakov@ispras.ru>
> >
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
>
> This is describing limitations of Intel's AVX512 implementation.
>
> AMD's AVX512 implementation is said to not have the kind of
> power / frequency limitations that Intel's does:
>
> https://www.mersenneforum.org/showthread.php?p=614191
>
> "Overall, AMD's AVX512 implementation beat my expectations.
> I was expecting something similar to Zen1's "double-pumping"
> of AVX with half the register file and cross-lane instructions
> being super slow. But this is not the case on Zen4. The lack
> of power or thermal issues combined with stellar shuffle support
> makes it completely worthwhile to use from a developer standpoint.
> If your code can vectorize without excessive wasted computation,
> then go all the way to 512-bit. AMD not only made this worthwhile,
> but *incentivizes* it with the power savings. And if in the future
> AMD decides to widen things up, you may get a 2x speedup for free."
>
> IOW, it sounds like we could be sacrificing performance on modern
> AMD Genoa generation CPUs by removing the AVX512 impl
No, the new implementation saturates load ports, and Genoa runs 512-bit
AVX instructions at half throughput compared to their 256-bit counterparts
(so one 512-bit load or two 256-bit loads per cycle), so there's no
obvious reason why this patch would sacrifice performance there.
Maybe it could, indirectly, by lowering the turbo clock limit due to
higher front-end activity, but I don't have access to a Zen 4 machine
to check, and even so it would be a few percent, not 2x.
Alexander
- [PATCH v6 00/10] Optimize buffer_is_zero, Richard Henderson, 2024/04/24
- [PATCH v6 01/10] util/bufferiszero: Remove SSE4.1 variant, Richard Henderson, 2024/04/24
- [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant, Richard Henderson, 2024/04/24
- [PATCH v6 03/10] util/bufferiszero: Reorganize for early test for acceleration, Richard Henderson, 2024/04/24
- [PATCH v6 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants, Richard Henderson, 2024/04/24
- [PATCH v6 06/10] util/bufferiszero: Improve scalar variant, Richard Henderson, 2024/04/24
- [PATCH v6 04/10] util/bufferiszero: Remove useless prefetches, Richard Henderson, 2024/04/24
- [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64, Richard Henderson, 2024/04/24