I just uploaded 0.3 of my ezesdk which includes a 'large' FFT sample (i.e. something that can't fit on-chip). After building the sdk, go to samples/fft and run make to build it. Run it using sudo ./fft -r rows -c cols (make sure you don't specify more than 4 for either) and as it hardcodes the platform file it only works on a parallella-16. I'm not using the parallella sdk because it's just ... well it's just not much fun to work with, but it means i don't know if this stuff works anywhere but my (rev0) system.
The sample implements a forward complex fft for 2^20 elements by breaking it into steps of 1024 which fits into LDS and allows the fft to be calculated using two passes through external memory. It needs to operate out-of-place.
Even with synchronous dma transfers, calculating the twiddle factors on-core, and a not particularly sophisticated radix-4 inner loop, memory bandwidth is the limiting factor beyond 4 cores being active (it's already limiting beyond 2).
http://a-hackers-craic.blogspot.com.au/ ... inary.html
http://www.users.on.net/notzed/software/ezesdk.html
Given memory is the bottleneck, the only thing I can think of to get more performance is to serialise (and group where possible) the dma writes to get a bit more bandwidth. Unless i'm missing something and there's a way to get away with a single main memory pass.
knew i should've left it till tomorrow. the fft stuff in 0.3 is broken as i checked in some experiments, 0.3.1 coming soon.