Re: [PATCH v1 08/11] target/arm: Implement bfloat16 matrix multiply accu

qemu-arm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 08/11] target/arm: Implement bfloat16 matrix multiply accu

From:	Richard Henderson
Subject:	Re: [PATCH v1 08/11] target/arm: Implement bfloat16 matrix multiply accumulate
Date:	Tue, 18 May 2021 09:45:18 -0500
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

On 5/18/21 7:37 AM, Peter Maydell wrote:

On Sat, 17 Apr 2021 at 01:00, Richard Henderson
<richard.henderson@linaro.org> wrote:


This is BFMMLA for both AArch64 AdvSIMD and SVE,
and VMMLA.BF16 for AArch32 NEON.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

+void HELPER(gvec_bfmmla)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
+{
+    intptr_t s, opr_sz = simd_oprsz(desc);
+    float32 *d = vd, *a = va;
+    uint32_t *n = vn, *m = vm;
+
+    for (s = 0; s < opr_sz / 4; s += 4) {
+        float32 sum00, sum01, sum10, sum11;
+
+        /*
+         * Process the entire segment at once, writing back the
+         * results only after we've consumed all of the inputs.
+         *
+         * Key to indicies by column:


"indices"

+         *               i   j           i   k             j   k
+         */
+        sum00 = a[s + H4(0 + 0)];
+        sum00 = bfdotadd(sum00, n[s + H4(0 + 0)], m[s + H4(0 + 0)]);
+        sum00 = bfdotadd(sum00, n[s + H4(0 + 1)], m[s + H4(0 + 1)]);


I can't make these indices match up with the arm arm pseudocode ones,
which index by "4*i + 2*k + 0" and "4*i + 2*k + 1", not "2*i + k";
are we hiding a division by 2 somewhere?


Yes.  We're passing BFloat16 pairs via uint32_t[] to bfdotadd().


r~

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v1 08/11] target/arm: Implement bfloat16 matrix multiply accumulate, Peter Maydell, 2021/05/18
- Re: [PATCH v1 08/11] target/arm: Implement bfloat16 matrix multiply accumulate, Richard Henderson <=

Prev by Date: Re: [PATCH v1 07/11] target/arm: Implement bfloat16 dot product (indexed)
Next by Date: Re: [PATCH v3 0/8] GICv3 LPI and ITS feature implementation
Previous by thread: Re: [PATCH v1 08/11] target/arm: Implement bfloat16 matrix multiply accumulate
Next by thread: Re: [PATCH v1 09/11] target/arm: Implement bfloat widening fma (vector)
Index(es):
- Date
- Thread