Hi Brian,
On 17.02.21 16:55, Brian Padalino wrote:
> On Wed, Feb 17, 2021 at 10:05 AM Marcus Müller <mueller@kit.edu <mailto:mueller@kit.edu>>
> wrote:
>
> Oh, sorry, didn't mean to imply that! FFT
interpolation might work well (if you can live
> with the sinc sidelobes).
>
>
> If the bandwidth is already constrained to 20MHz/23MHz,
then there would be no sidelobes -
> correct?
What the FFT resampler does is, in the end, interpolation
with 23 sincs, each at the
"original" spectral position.
These sincs' sidelobes of course extent left and right
beyond the 23 original "carriers"
(thinking of OFDM here)!
Just wanted to clarify - 20 carriers are occupied, but
sample rate goes out to 23 carriers. If we assume the extent
of the carriers is already at a stopband and sufficiently
bandlimited, the sinc sidelobes would never go above the
already established stopband - no extra spectral growth
really - just extension of the stopband. Right?
The question here is whether that matters: if the bandwidth
of interest is -10 to +10 MHz,
what do I care about what happens beyond that? The answer to
that quesiton lies in Mark's
application :)
The level of the newly created spectrum should be at the
already established stopband or below it. Right?
> I do have a question, though: Why do you go for a
4600-FFT? why not simply a 23 FFT?
>
> Larger n yields greater efficiencies, right? Doing
lots of small calls isn't necessarily
> as efficient as doing less large calls, so long as you
can handle the latency and the
> processor can stay fed with data. I figure ~4000
samples is a good compromise, but maybe
> larger ones work out better, too.
Well, quick calc, and acting as if O-notation was actually
meaningful on hardware (see
David's answer, he's got a point about caches: on my CPU, a
32×32 bit multiplication
instruction (non-SIMD) takes about 1/400 of getting memory
from random positions in RAM
(cache misses), on average, IIRC.
If we've got M samples to process, and we're doing N_1 or
N_2=100·N_1 length transforms,
the complexity of the first is
a = O( (M/N_1) · N_·log(N_1) ) = O( M · log(N_1) )
of the second, it'd be
O( (M/100/N_1) · 100·N_1 log(100·N_1) ) = O( M ·
log(100·N_1) )
= O( M · log(N_1) + M log(100) )
which is worse, but:
>
> For reference, here's some performance benchmarks for
FFTW (randomly chosen processor):
>
> http://www.fftw.org/speed/Pentium4-3.06GHz/
<http://www.fftw.org/speed/Pentium4-3.06GHz/>
>
> Single precision, non powers of 2, 1d transforms are
some good benchmarks to look at for
> something like this. Obviously some numbers are better
than others, but the general trend
> seems to skew towards the mid-thousands as being a
decent number to target for maximizing
> throughput.
Real-world data, and I fully agree, always trumps O-notation
(which simply is a bad model
for any real-world problem, usually).
Luckily, I've had this same problem before, so... See
attachment!
You should be able to call it like
fft_rate -l 46
to set the FFT length to 46, and add a -n 10 to make the FFT
do 10 threads (which usually
hurts more than it helps). (Need to multiply the reported
rate by the FFT length, it
counts transforms, not samples.)
The single-threaded 2²⁴-FFT I do runs at ca. 34.5 MS/s
throughput. If I run
fft_rate -l $(( 23*2*100 ))
I get rougly 87.4 MS/s.
If I run at
fft_rate -l $(( 23*2 ))
I get ca 92 MS/s.
I'd call this a clear "meh" result!
Fair enough regarding the smaller FFT size. I didn't
think it would work out this way, but it also got me
scratching my head. I wouldn't have imagined parallelizing
a bunch of FFT's would make things worse. And then I went
to the code, and I see this:
So this call is initializing fftw as an n-thread
operation. The call to fftw is still synchronous, so we're
asking fftw to parallelize the FFT and ensure it's all
coalesced by the time the call returns. I can understand
why this would hurt FFT performance.
I then ran the program with 4 or 8 threads as an
argument, and looked at top - but only a single processor
was being used. Maybe my gnuradio wasn't compiled with
multithreading fftw enabled?
For your example program with defaults, I get a rate_avg
on a single core of 91 transforms/sec but only a single core
is really being used. When I spin up 4 individual
instances, each one gets around 55, so an aggregate of 220
transforms/sec.
When I switch to 4600 for the block size and 4 separate
instances, I get around 20k transforms/sec for each, or 80k
transforms/sec in aggregate. This translates to
368Msamples/sec - right? Note my processor is an Intel
i7-7700k. Also note that changing the n input argument
doesn't help me much since it doesn't seem that fftw is
really parallelizing anything.
I would imagine setting n would yield the same results as
running n instances, but since it doesn't, should the
implementation be modified to not rely on fftw to perform
the multithreading, but instead ask gnuradio to handle the
multithreading via a threadpool?
Marcus, can you see if running n instances versus n
threads yields the same performance results on your system?
Brian