Re: Resampling radio data

On Wed, Feb 17, 2021 at 12:02 PM Marcus Müller <mueller@kit.edu> wrote:

Hi Brian,

On 17.02.21 16:55, Brian Padalino wrote:
> On Wed, Feb 17, 2021 at 10:05 AM Marcus Müller <mueller@kit.edu <mailto:mueller@kit.edu>>
> wrote:
>
> Oh, sorry, didn't mean to imply that! FFT interpolation might work well (if you can live
> with the sinc sidelobes).
>
>
> If the bandwidth is already constrained to 20MHz/23MHz, then there would be no sidelobes -
> correct?

What the FFT resampler does is, in the end, interpolation with 23 sincs, each at the
"original" spectral position.
These sincs' sidelobes of course extent left and right beyond the 23 original "carriers"
(thinking of OFDM here)!

Just wanted to clarify - 20 carriers are occupied, but sample rate goes out to 23 carriers. If we assume the extent of the carriers is already at a stopband and sufficiently bandlimited, the sinc sidelobes would never go above the already established stopband - no extra spectral growth really - just extension of the stopband. Right?

The question here is whether that matters: if the bandwidth of interest is -10 to +10 MHz,
what do I care about what happens beyond that? The answer to that quesiton lies in Mark's
application :)

The level of the newly created spectrum should be at the already established stopband or below it. Right?

> I do have a question, though: Why do you go for a 4600-FFT? why not simply a 23 FFT?
>
> Larger n yields greater efficiencies, right? Doing lots of small calls isn't necessarily
> as efficient as doing less large calls, so long as you can handle the latency and the
> processor can stay fed with data. I figure ~4000 samples is a good compromise, but maybe
> larger ones work out better, too.

Well, quick calc, and acting as if O-notation was actually meaningful on hardware (see
David's answer, he's got a point about caches: on my CPU, a 32×32 bit multiplication
instruction (non-SIMD) takes about 1/400 of getting memory from random positions in RAM
(cache misses), on average, IIRC.

If we've got M samples to process, and we're doing N_1 or N_2=100·N_1 length transforms,
the complexity of the first is

a = O( (M/N_1) · N_·log(N_1) ) = O( M · log(N_1) )

of the second, it'd be

O( (M/100/N_1) · 100·N_1 log(100·N_1) ) = O( M · log(100·N_1) )
= O( M · log(N_1) + M log(100) )

which is worse, but:

>
> For reference, here's some performance benchmarks for FFTW (randomly chosen processor):
>
> http://www.fftw.org/speed/Pentium4-3.06GHz/ <http://www.fftw.org/speed/Pentium4-3.06GHz/>
>
> Single precision, non powers of 2, 1d transforms are some good benchmarks to look at for
> something like this. Obviously some numbers are better than others, but the general trend
> seems to skew towards the mid-thousands as being a decent number to target for maximizing
> throughput.

Real-world data, and I fully agree, always trumps O-notation (which simply is a bad model
for any real-world problem, usually).

Luckily, I've had this same problem before, so... See attachment!

You should be able to call it like

fft_rate -l 46

to set the FFT length to 46, and add a -n 10 to make the FFT do 10 threads (which usually
hurts more than it helps). (Need to multiply the reported rate by the FFT length, it
counts transforms, not samples.)

The single-threaded 2²⁴-FFT I do runs at ca. 34.5 MS/s throughput. If I run

fft_rate -l $(( 23*2*100 ))

I get rougly 87.4 MS/s.
If I run at

fft_rate -l $(( 23*2 ))

I get ca 92 MS/s.

I'd call this a clear "meh" result!

Fair enough regarding the smaller FFT size. I didn't think it would work out this way, but it also got me scratching my head. I wouldn't have imagined parallelizing a bunch of FFT's would make things worse. And then I went to the code, and I see this:

https://github.com/gnuradio/gnuradio/blob/fa3267982aa88e3c8f886d803b57f9b1501f43f6/gr-fft/lib/fft.cc#L231

So this call is initializing fftw as an n-thread operation. The call to fftw is still synchronous, so we're asking fftw to parallelize the FFT and ensure it's all coalesced by the time the call returns. I can understand why this would hurt FFT performance.

I then ran the program with 4 or 8 threads as an argument, and looked at top - but only a single processor was being used. Maybe my gnuradio wasn't compiled with multithreading fftw enabled?

For your example program with defaults, I get a rate_avg on a single core of 91 transforms/sec but only a single core is really being used. When I spin up 4 individual instances, each one gets around 55, so an aggregate of 220 transforms/sec.

When I switch to 4600 for the block size and 4 separate instances, I get around 20k transforms/sec for each, or 80k transforms/sec in aggregate. This translates to 368Msamples/sec - right? Note my processor is an Intel i7-7700k. Also note that changing the n input argument doesn't help me much since it doesn't seem that fftw is really parallelizing anything.

I would imagine setting n would yield the same results as running n instances, but since it doesn't, should the implementation be modified to not rely on fftw to perform the multithreading, but instead ask gnuradio to handle the multithreading via a threadpool?

Marcus, can you see if running n instances versus n threads yields the same performance results on your system?

Brian

From:	Ron Economos
Subject:	Re: Resampling radio data
Date:	Wed, 17 Feb 2021 14:20:54 -0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0