Last month I spent some time writing some FFT code for the epiphany.

You can find the resulting code here : https://github.com/smunaut/parallella-e ... _fft_asm.S

This is a classic Decimation-In-Time implementation. The output is out-of-order and needs to be re-ordered (but for some applications, including mine, this can be merged with the next step without overhead). I might take a look if a Decimation-In-Frequency implementation works better/worse on the epiphany ... at the moment I just took the one in the lena example as reference and re-implemented it in assembly.

Every loop has been hand assembled to reach maximum speed. Nearly every cycle is dual-issued, doing a floating point ops at the same time as a integer/store/load ops. It also uses the HW loop feature for zero overhead inner loops. Each instruction has been scheduled to minimize stalls. The loop has also been "pipelined", so at iteration i, the loop does the store or result i-1, the computation of i, and the pre-load of data of i+1, all those interlaced to maximize dual-issuing. You also have to "load" and "flush" the loop pipeline. And also the first and last stage of the FFT decomposition are degenerate and handled in dedicated functions.

This can make the code pretty hard to follow :p

And finally here are the numbers :

- Code: Select all
`fftw : 221.977509 Mflops`

fftw NEON : 409.619659 Mflops

epiphany C : 170.642029 Mflops

epiphany ASM : 668.507629 Mflops

All of theses are for a single core (so one ARM core or one epiphany core), and they dont take memory transfer into account (so for ARM it's working out of cache and epiphany out of local mem).

I thought this was pretty much as good as it was going to get, but turns out there is a commercial lib of optimized primitives for the epiphany that has a FFT that's about 10% faster than this, using Radix-4 instead of Radix-2. I might at some point see if I can do a Radix-4 version, but for the time being it's good enough for me.

Cheers,

Sylvain