Page 1 of 1

Very Fast Fourrier Transform

PostPosted: Mon Aug 03, 2015 8:37 pm
by tnt

Last month I spent some time writing some FFT code for the epiphany.

You can find the resulting code here : ... _fft_asm.S

This is a classic Decimation-In-Time implementation. The output is out-of-order and needs to be re-ordered (but for some applications, including mine, this can be merged with the next step without overhead). I might take a look if a Decimation-In-Frequency implementation works better/worse on the epiphany ... at the moment I just took the one in the lena example as reference and re-implemented it in assembly.

Every loop has been hand assembled to reach maximum speed. Nearly every cycle is dual-issued, doing a floating point ops at the same time as a integer/store/load ops. It also uses the HW loop feature for zero overhead inner loops. Each instruction has been scheduled to minimize stalls. The loop has also been "pipelined", so at iteration i, the loop does the store or result i-1, the computation of i, and the pre-load of data of i+1, all those interlaced to maximize dual-issuing. You also have to "load" and "flush" the loop pipeline. And also the first and last stage of the FFT decomposition are degenerate and handled in dedicated functions.

This can make the code pretty hard to follow :p

And finally here are the numbers :

Code: Select all
fftw         : 221.977509 Mflops
fftw NEON    : 409.619659 Mflops
epiphany C   : 170.642029 Mflops
epiphany ASM : 668.507629 Mflops

All of theses are for a single core (so one ARM core or one epiphany core), and they dont take memory transfer into account (so for ARM it's working out of cache and epiphany out of local mem).

I thought this was pretty much as good as it was going to get, but turns out there is a commercial lib of optimized primitives for the epiphany that has a FFT that's about 10% faster than this, using Radix-4 instead of Radix-2. I might at some point see if I can do a Radix-4 version, but for the time being it's good enough for me.



Re: Very Fast Fourrier Transform

PostPosted: Mon Aug 03, 2015 9:13 pm
by aolofsson
So your routine on a simple scalar Epiphany core at 600MHz runs 39% faster than FFTW running on a 4-way A9 core at 667MHz?
I'd say that's pretty darn impressive! :D


Re: Very Fast Fourrier Transform

PostPosted: Tue Aug 04, 2015 7:43 am
by tnt
Yeah, I'm pretty happy about it.

The big advantage of the epiphany in this case are:
- Large register file : except for fft data load / store, there is no memory access for temporary results. Despite having loop pipelining and processing 4 data per loop iteration (2 radix-2 ops in //), I only ever use registers, and even only the "caller saver" registers so I don't even need to save/restore them.
- BITR opcode : infinitely useful for this :p
- Easy to predict low level behavior: Because I can understand exactly how the CPU will execute stuff, I can tailor the operations manually much better. Optimizing for ARM (or even worse Intel) has so many rules to follow that I can't keep them all in my head ...

Next step will probably be to extend this for higher point FFTs using multiple cores. (The current one is local mem only, so you can do at most 2048 points, but more realistically 1024 when using double-buffering)

Re: Very Fast Fourrier Transform

PostPosted: Tue Aug 04, 2015 12:26 pm
by aolofsson
That's great to hear! Look forward to your inputs in the following topic.