[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch6
From: |
Alexander Monakov |
Subject: |
Re: [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64 |
Date: |
Thu, 15 Feb 2024 11:47:36 +0300 (MSK) |
On Wed, 14 Feb 2024, Richard Henderson wrote:
> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
> double-check with the compiler flags for __ARM_NEON and don't bother with
> a runtime check. Otherwise, model the loop after the x86 SSE2 function,
> and use VADDV to reduce the four vector comparisons.
I am not very familiar with Neon but I wonder if this couldn't use SHRN
for the final 128b->64b reduction similar to 2022 Glibc optimizations:
https://inbox.sourceware.org/libc-alpha/20220620174628.2820531-1-danilak@google.com/
In git history I see the previous Neon buffer_is_zero was removed because
it was not faster. Is it because integer LDP was as good as vector loads
at saturating load bandwidth on older cores, and things are different now?
Alexander
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> util/bufferiszero.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 74 insertions(+)
>
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index 4eef6d47bc..2809b09225 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -214,7 +214,81 @@ bool test_buffer_is_zero_next_accel(void)
> }
> return false;
> }
> +
> +#elif defined(__aarch64__) && defined(__ARM_NEON)
> +#include <arm_neon.h>
> +
> +#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
> +
> +static bool buffer_is_zero_simd(const void *buf, size_t len)
> +{
> + uint32x4_t t0, t1, t2, t3;
> +
> + /* Align head/tail to 16-byte boundaries. */
> + const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
> + const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
> +
> + /* Unaligned loads at head/tail. */
> + t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
> +
> + /* Collect a partial block at tail end. */
> + t1 = e[-7] | e[-6];
> + t2 = e[-5] | e[-4];
> + t3 = e[-3] | e[-2];
> + t0 |= e[-1];
> + REASSOC_BARRIER(t0, t1);
> + REASSOC_BARRIER(t2, t3);
> + t0 |= t1;
> + t2 |= t3;
> + REASSOC_BARRIER(t0, t2);
> + t0 |= t2;
> +
> + /*
> + * Loop over complete 128-byte blocks.
> + * With the head and tail removed, e - p >= 14, so the loop
> + * must iterate at least once.
> + */
> + do {
> + /* Each comparison is [-1,0], so reduction is in [-4..0]. */
> + if (unlikely(vaddvq_u32(vceqzq_u32(t0)) != -4)) {
> + return false;
> + }
> +
> + t0 = p[0] | p[1];
> + t1 = p[2] | p[3];
> + t2 = p[4] | p[5];
> + t3 = p[6] | p[7];
> + REASSOC_BARRIER(t0, t1);
> + REASSOC_BARRIER(t2, t3);
> + t0 |= t1;
> + t2 |= t3;
> + REASSOC_BARRIER(t0, t2);
> + t0 |= t2;
> + p += 8;
> + } while (p < e - 7);
> +
> + return vaddvq_u32(vceqzq_u32(t0)) == -4;
> +}
> +
> +static biz_accel_fn const accel_table[] = {
> + buffer_is_zero_int_ge256,
> + buffer_is_zero_simd,
> +};
> +
> +static unsigned accel_index = 1;
> +#define INIT_ACCEL buffer_is_zero_simd
> +
> +bool test_buffer_is_zero_next_accel(void)
> +{
> + if (accel_index != 0) {
> + buffer_is_zero_accel = accel_table[--accel_index];
> + return true;
> + }
> + return false;
> +}
> +
> #else
> +
> bool test_buffer_is_zero_next_accel(void)
> {
> return false;
>
[PATCH v4 01/10] util/bufferiszero: Remove SSE4.1 variant, Richard Henderson, 2024/02/15
[PATCH v4 02/10] util/bufferiszero: Remove AVX512 variant, Richard Henderson, 2024/02/15
[PATCH v4 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants, Richard Henderson, 2024/02/15
[PATCH v4 06/10] util/bufferiszero: Improve scalar variant, Richard Henderson, 2024/02/15
[PATCH v4 07/10] util/bufferiszero: Introduce biz_accel_fn typedef, Richard Henderson, 2024/02/15
[PATCH v4 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel, Richard Henderson, 2024/02/15