|
From: | Marcus Müller |
Subject: | Re: Volk sqrt ARM performance |
Date: | Sun, 8 Oct 2023 19:31:06 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 |
Hi Jeff,
you'll want to compile with optimization, otherwise you'd be
intentionally making the native `sqrt` slower than it would be in
a real application; you need to add `-O2` or `-O3` to your
compilation. Also, you're using floats, not doubles, so use
`sqrtf` in your C code, not `sqrt`! (your code is C, not
necessarily how you'd write the same program in C++).
Also, compared to the time for the math you're doing, both in the volk and in the libm sqrt case, your time measurement's uncertainty is large. (taking the square root of only 16k values – that's nearly nothing.) You need to run that in a loop of many iterations, preferably with some warm-up to get the branch predictors trained. (assuming the CPU *has* branch prediction – the ARM1176JZ-S doesn't, as far as I know).
Hey, luckily your VOLK already ships with such a loop-running
benchmark mockup: `volk_profile -R sqrt` will do exactly that. The
`generic` implementation literally just calls `sqrtf`. Could you
share the output of `volk_profile -R sqrt` with us?
Furthermore, I'm **highly** confused by your results: ARM1176JZ-S is a 32 bit processor, developed somewhere in the early 2000s; so, it's –by modern standards– a painfully slow 32 bit armv6 CPU. It predates both aarch64 and NEON! So, I'm pretty sure cpu_features must be wrong, or this is not the CPU you're using. In this rare case, I think you must be wrong and not the software, because you're also using a /usr/local/lib64 library path, which would quite unambigously point to a 64 bit OS, which couldn't run on an ARM11.
Could you double-check and *confirm* you're using an ARM1176JZ-S
processor? If you are, are you perhaps running this with
qemu-aarch64 on your armv6 (32 bit!) machine? Can you send us the
`volk_sqrt` you're getting, or at least share what `file
volk_sqrt` says about that binary?
We then would need to help you file a bug upstream against
cpu_features, because it'd be impossible for us to build a working
VOLK if cpu_features goes and miscategorizes an ancient 32 bit
machine as aarch64.
Best regards,
Marcus
I modified a simple Volk sqrt program for an ARM1176JZ-S processor to test performance, and the results are puzzling. The following program prints:
dur_VolkSqrt=(0.000000)0.001721 dur_CRTLSqrt=(0.000000)0.000318
The following processor information is displayed. It appears as though NEON is supported.
~/volk-3.0.0/build# cpu_features/list_cpu_features
arch : aarch64
implementer : 65 (0x41)
variant : 0 (0x00)
part : 3336 (0xD08)
revision : 3 (0x03)
flags : asimd,cpuid,crc32,fp
Why are the numbers so slow for Volk versus the CRTL? I may be missing something obvious. Thank you in advance.
Here’s the test program:
// g++ -I /usr/local/include/volk volk_sqrt.cpp -o volk_sqrt -L /usr/local/lib64/ -lvolk
// export LD_LIBRARY_PATH=/usr/local/lib64; ./volk_sqrt
#include <stdio.h>
#include <math.h>
#include <volk.h>
#include <limits.h>
#include <time.h>
#include <sys/time.h>
double get_wall_time()
{
struct timeval time;
if (gettimeofday(&time,NULL))
{
// Handle error
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
int main(int argc, char* args[])
{
double walStop;
double walStart;
double dur_VolkSqrt;
double dur_CRTLSqrt;
int N = 1024*16;
unsigned int alignment = volk_get_alignment();
float* in = (float*)volk_malloc(sizeof(float)*N, alignment);
float* out = (float*)volk_malloc(sizeof(float)*N, alignment);
for(unsigned int ii = 0; ii < N; ++ii)
{
in[ii] = (float)(ii*ii);
}
walStart = get_wall_time();
volk_32f_sqrt_32f_a(out, in, N);
//volk_32f_sqrt_32f(out, in, N);
walStop = get_wall_time();
dur_VolkSqrt = walStop - walStart;
walStart = get_wall_time();
for(unsigned int ii = 0; ii < N; ++ii)
{
out[ii] = sqrt(in[ii]);
}
walStop = get_wall_time();
dur_CRTLSqrt = walStop - walStart;
printf("dur_VolkSqrt=(%f)%f dur_CRTLSqrt=(%f)%f\n", dur_VolkSqrt/N, dur_VolkSqrt, dur_CRTLSqrt/N, dur_CRTLSqrt);
volk_free(in);
volk_free(out);
return 0;
}
[Prev in Thread] | Current Thread | [Next in Thread] |