Hi,
As some of you are already aware the current RVV emulation could be faster.
We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
skip set tail when vta is zero") that tried to address at least part of the
problem.
Running a simple program like this:
-------
#define SZ 10000000
int main ()
{
int *a = malloc (SZ * sizeof (int));
int *b = malloc (SZ * sizeof (int));
int *c = malloc (SZ * sizeof (int));
for (int i = 0; i < SZ; i++)
c[i] = a[i] + b[i];
return c[SZ - 1];
}
-------
And then compiling it without RVV support will run in 50 milis or so:
$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128
./foo-novect.out
real 0m0.043s
user 0m0.025s
sys 0m0.018s
Building the same program with RVV support slows it 4-5 times:
$ time ~/work/qemu/build/qemu-riscv64 -cpu
rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out
real 0m0.196s
user 0m0.177s
sys 0m0.018s
Using the lowest 'vlen' val allowed (128) will slow down things even further,
taking it to
~0.260s.
'perf record' shows the following profile on the aforementioned binary:
23.27% qemu-riscv64 qemu-riscv64 [.] do_ld4_mmu
21.11% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us
14.05% qemu-riscv64 qemu-riscv64 [.] cpu_ldl_le_data_ra
11.51% qemu-riscv64 qemu-riscv64 [.] cpu_stl_le_data_ra
8.18% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup
8.04% qemu-riscv64 qemu-riscv64 [.] do_st4_mmu
2.04% qemu-riscv64 qemu-riscv64 [.] ste_w
1.15% qemu-riscv64 qemu-riscv64 [.] lde_w
1.02% qemu-riscv64 [unknown] [k] 0xffffffffb3001260
0.90% qemu-riscv64 qemu-riscv64 [.] cpu_get_tb_cpu_state
0.64% qemu-riscv64 qemu-riscv64 [.] tb_lookup
0.64% qemu-riscv64 qemu-riscv64 [.] riscv_cpu_mmu_index
0.39% qemu-riscv64 qemu-riscv64 [.] object_dynamic_cast_assert
First thing that caught my attention is vext_ldst_us from
target/riscv/vector_helper.c:
/* load bytes from guest memory */
for (i = env->vstart; i < evl; i++, env->vstart++) {
k = 0;
while (k < nf) {
target_ulong addr = base + ((i * nf + k) << log2_esz);
ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
k++;
}
}
env->vstart = 0;
Given that this is a unit-stride load that access contiguous elements in memory
it
seems that this loop could be optimized/removed since it's loading/storing bytes
one by one. I didn't find any TCG op to do that though. I assume that ARM SVE
might
have something of the sorts. Richard, care to comment?