Parallella Community

by **notzed** » Sat Jul 05, 2014 8:44 am

I just uploaded 0.3 of my ezesdk which includes a 'large' FFT sample (i.e. something that can't fit on-chip). After building the sdk, go to samples/fft and run make to build it. Run it using sudo ./fft -r rows -c cols (make sure you don't specify more than 4 for either) and as it hardcodes the platform file it only works on a parallella-16. I'm not using the parallella sdk because it's just ... well it's just not much fun to work with, but it means i don't know if this stuff works anywhere but my (rev0) system.

The sample implements a forward complex fft for 2^20 elements by breaking it into steps of 1024 which fits into LDS and allows the fft to be calculated using two passes through external memory. It needs to operate out-of-place.

Even with synchronous dma transfers, calculating the twiddle factors on-core, and a not particularly sophisticated radix-4 inner loop, memory bandwidth is the limiting factor beyond 4 cores being active (it's already limiting beyond 2).

http://a-hackers-craic.blogspot.com.au/ ... inary.html

http://www.users.on.net/notzed/software/ezesdk.html

Given memory is the bottleneck, the only thing I can think of to get more performance is to serialise (and group where possible) the dma writes to get a bit more bandwidth. Unless i'm missing something and there's a way to get away with a single main memory pass.

knew i should've left it till tomorrow. the fft stuff in 0.3 is broken as i checked in some experiments, 0.3.1 coming soon.

by **Bikeman** » Sat Jul 05, 2014 9:57 am

Very interesting, thanks for sharing.

You might have read that I have a little bet running that on an FFT of length 3*2^22 (which is used for a specific application of the Einstein@Home project, therefore the strange length), the Raspi GPU would outperform the Parallella E16. The Raspi GPU wil also be memory bound but I think its connection to DRAM is a bit faster.

I wonder how you compute the complex roots of unity. In a similar context I have found that if you need sin(2pi *k/N) and cos(2pi *k/N)for k=0..N-1 , it's quite fast to just tabulate the sine an cosine values for angles alpha_j =2pi*M*j and beta_j=2pi*j , j= 0...M-1 , M=sqrt(N), in two tables x 2 (cos & sin) of length M each (so total length 4sqrt(N) = 4098 here) , and then compute the twiddle factors on the fly from those via trig. addition theorem. Is that what you are doing?

Cheers
HB

by **notzed** » Sun Jul 06, 2014 1:58 pm

by **Bikeman** » Sun Jul 06, 2014 2:48 pm

by **notzed** » Mon Jul 07, 2014 9:01 am

by **timpart** » Mon Jul 07, 2014 11:59 am

by **notzed** » Mon Jul 07, 2014 2:53 pm

by **notzed** » Thu Jul 10, 2014 8:41 am

by **timpart** » Thu Jul 10, 2014 11:46 am

Thanks for the extra references. I hope I haven't distracted you too much from your original goal.

Tim

by **notzed** » Thu Jul 10, 2014 1:30 pm

Fortunately I have no goals here other than the journey itself and often the side-quests like this are as interesting as the main game. Set end-goals are for work!

I don't have any use for a large fft to start with.

Parallella Community

FFT sample for 2^20 elements

FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Re: FFT sample for 2^20 elements

Who is online