FFT sample for 2^20 elements

Generic algorithms that could be implemented in almost any language, e.g. matrix operations and FFT.

FFT sample for 2^20 elements

Postby notzed » Sat Jul 05, 2014 8:44 am

I just uploaded 0.3 of my ezesdk which includes a 'large' FFT sample (i.e. something that can't fit on-chip). After building the sdk, go to samples/fft and run make to build it. Run it using sudo ./fft -r rows -c cols (make sure you don't specify more than 4 for either) and as it hardcodes the platform file it only works on a parallella-16. I'm not using the parallella sdk because it's just ... well it's just not much fun to work with, but it means i don't know if this stuff works anywhere but my (rev0) system.

The sample implements a forward complex fft for 2^20 elements by breaking it into steps of 1024 which fits into LDS and allows the fft to be calculated using two passes through external memory. It needs to operate out-of-place.

Even with synchronous dma transfers, calculating the twiddle factors on-core, and a not particularly sophisticated radix-4 inner loop, memory bandwidth is the limiting factor beyond 4 cores being active (it's already limiting beyond 2).

http://a-hackers-craic.blogspot.com.au/ ... inary.html

http://www.users.on.net/notzed/software/ezesdk.html

Given memory is the bottleneck, the only thing I can think of to get more performance is to serialise (and group where possible) the dma writes to get a bit more bandwidth. Unless i'm missing something and there's a way to get away with a single main memory pass.

knew i should've left it till tomorrow. the fft stuff in 0.3 is broken as i checked in some experiments, 0.3.1 coming soon.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: FFT sample for 2^20 elements

Postby Bikeman » Sat Jul 05, 2014 9:57 am

Very interesting, thanks for sharing.

You might have read that I have a little bet running that on an FFT of length 3*2^22 (which is used for a specific application of the Einstein@Home project, therefore the strange length), the Raspi GPU would outperform the Parallella E16. The Raspi GPU wil also be memory bound but I think its connection to DRAM is a bit faster.

I wonder how you compute the complex roots of unity. In a similar context I have found that if you need sin(2pi *k/N) and cos(2pi *k/N)for k=0..N-1 , it's quite fast to just tabulate the sine an cosine values for angles alpha_j =2pi*M*j and beta_j=2pi*j , j= 0...M-1 , M=sqrt(N), in two tables x 2 (cos & sin) of length M each (so total length 4sqrt(N) = 4098 here) , and then compute the twiddle factors on the fly from those via trig. addition theorem. Is that what you are doing?

Cheers
HB
Bikeman
 
Posts: 52
Joined: Wed Sep 11, 2013 8:55 pm

Re: FFT sample for 2^20 elements

Postby notzed » Sun Jul 06, 2014 1:58 pm

Bikeman wrote:Very interesting, thanks for sharing.

You might have read that I have a little bet running that on an FFT of length 3*2^22 (which is used for a specific application of the Einstein@Home project, therefore the strange length), the Raspi GPU would outperform the Parallella E16. The Raspi GPU wil also be memory bound but I think its connection to DRAM is a bit faster.


I did yes, i have had it in the back of my mind but haven't looked at it specifically. I'm only looking at powers of 2 though.

I wonder how you compute the complex roots of unity. In a similar context I have found that if you need sin(2pi *k/N) and cos(2pi *k/N)for k=0..N-1 , it's quite fast to just tabulate the sine an cosine values for angles alpha_j =2pi*M*j and beta_j=2pi*j , j= 0...M-1 , M=sqrt(N), in two tables x 2 (cos & sin) of length M each (so total length 4sqrt(N) = 4098 here) , and then compute the twiddle factors on the fly from those via trig. addition theorem. Is that what you are doing?

Cheers
HB


4096 element table of floats == 16K. That's too big to fit on each core and leave room for data and if it's not on-core it's only going to compound the memory bottleneck :( Thanks for the hint though, I could use it to reduce the number of calculations needed on-chip by a bit; not that it is a limiting factor so far.

Right now I'm directly calculating just enough of e^(I*w) using a hand-rolled sin+cos function for each size I need for a given loop. With a C implementation it's about 45% of the processing time.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: FFT sample for 2^20 elements

Postby Bikeman » Sun Jul 06, 2014 2:48 pm

Right now I'm directly calculating just enough of e^(I*w) using a hand-rolled sin+cos function for each size I need for a given loop. With a C implementation it's about 45% of the processing time.


Sounds substantial (even tho it's not the limiting factor as you said.

Just for the sake of completeness, the recipe I mentioned can also work with smaller tables at the expense of more ops.

Let's say we can spare roughly 4 KB per core, so 1024 floats, so 512 values for sine and cosine each.

Then you need to use three tables: (I use "sincos" as a function that returns a pair of values, the sine and cosine of its argument)

N=2^20

A[j] = sincos (2pi j /N ) for j = 0...127
B[j] = sincos (2pi j * 128 / N) for j= 0..127
C[j] = sincos (2pi j *128*128 /N) for j = 0 .. 63

(so actually we require only 420 x 2 x 4 = 3360 bytes)

so for some q = j + 128 * k + 128*128 * l ,

sin(2pi q/N ) = sin(2pi *( j + 128 * k + 128*128 * l ) /N)
=sin(2pi *j /N) * cos(2pi*(128 * k + 128*128 * l ) /N) + cos(2pi *j /N) * sin(2pi*(128 * k + 128*128 * l ) /N)
=A[j][0] * cos(2pi*(128 * k + 128*128 * l ) /N) + A[j][1]* sin(2pi*(128 * k + 128*128 * l ) /N)
=A[j][0] * (B[k][1]*C[l][1] - B[k][0]*C[l][0]) + A[j][1] *(B[k][0]*C[l][1] + B[k][1]*C[l][0])

and cos(2pi q/N ) = cos(2pi *( j + 128 * k + 128*128 * l ) /N) is left as an exercise to the reader 8-)

Not sure if that would be faster than your hand-rolled sin/cos, probably not. Needless to say, breaking up q = j + 128 * k + 128*128 * l into j,k,l is "just" extracting bits from the binary representation of q, I'm not sure if this (shifting, logical-AND) is fast on the Epiphany cores?

Cheers
HB
Bikeman
 
Posts: 52
Joined: Wed Sep 11, 2013 8:55 pm

Re: FFT sample for 2^20 elements

Postby notzed » Mon Jul 07, 2014 9:01 am

Bikeman wrote:
Right now I'm directly calculating just enough of e^(I*w) using a hand-rolled sin+cos function for each size I need for a given loop. With a C implementation it's about 45% of the processing time.


Sounds substantial (even tho it's not the limiting factor as you said.


Yeah I thought so too but its better than I expected. I just put it in because I had something that was small and worked and was going to worry about performance later. The fft code itself just isn't very fast either, just a basic radix-4 ct dit implementation but because of the speed of the local memory it runs ok.

Just for the sake of completeness, the recipe I mentioned can also work with smaller tables at the expense of more ops.
sin(2pi q/N ) = sin(2pi *( j + 128 * k + 128*128 * l ) /N)
=A[j][0] * (B[k][1]*C[l][1] - B[k][0]*C[l][0]) + A[j][1] *(B[k][0]*C[l][1] + B[k][1]*C[l][0])

and cos(2pi q/N ) = cos(2pi *( j + 128 * k + 128*128 * l ) /N) is left as an exercise to the reader 8-)

Not sure if that would be faster than your hand-rolled sin/cos, probably not. Needless to say, breaking up q = j + 128 * k + 128*128 * l into j,k,l is "just" extracting bits from the binary representation of q, I'm not sure if this (shifting, logical-AND) is fast on the Epiphany cores?


I would guess that it would end up slower (and less accurate): each sincos takes 12 floating point instructions (mix of fmul + fmadd) on lookup tables with complex addressing vs 14 floating point instructions on constants which can all be loaded into registers. I'm using a 6-term Macluarin series expansion and 5 would probably do, and be fewer ops.

All integer instructions (outside of the ones that go through the fpu) are single-cycle with no register access penalties for following instructions. And since they can dual issue with flops the addressing calculations can usually be hidden, although with the simple instruction set they can add up quickly. The compiler isn't very good at optimising array addressing either.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: FFT sample for 2^20 elements

Postby timpart » Mon Jul 07, 2014 11:59 am

notzed wrote:I would guess that it would end up slower (and less accurate): each sincos takes 12 floating point instructions (mix of fmul + fmadd) on lookup tables with complex addressing vs 14 floating point instructions on constants which can all be loaded into registers. I'm using a 6-term Macluarin series expansion and 5 would probably do, and be fewer ops.


For an alternative approach to evaluating functions using less terms this is interesting.

Tim
timpart
 
Posts: 302
Joined: Mon Dec 17, 2012 3:25 am
Location: UK

Re: FFT sample for 2^20 elements

Postby notzed » Mon Jul 07, 2014 2:53 pm

timpart wrote:
notzed wrote:I would guess that it would end up slower (and less accurate): each sincos takes 12 floating point instructions (mix of fmul + fmadd) on lookup tables with complex addressing vs 14 floating point instructions on constants which can all be loaded into registers. I'm using a 6-term Macluarin series expansion and 5 would probably do, and be fewer ops.


For an alternative approach to evaluating functions using less terms this is interesting.

Tim


Hmm, very interesting, thanks.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: FFT sample for 2^20 elements

Postby notzed » Thu Jul 10, 2014 8:41 am

So I ended up coming across another tool that can do the same thing as lolremez.

http://sollya.gforge.inria.fr/

Faster and easier to use for the mathematically challenged and handles some of the very fiddly stuff with float rounding directly.

Code: Select all
> P = fpminimax(sin(x), [|1,3,5,7|], [|24...|],[0x1p-16000;pi/4]);
> set display=hexadecimal;
> P;
x * (0x1.p0 + x^0x2.p0 * (-0x2.aaaa88p-4 + x^0x2.p0 * (0x2.220dccp-8 + x^0x2.p0 * (-0xc.c967cp-16))))
> set display=decimal;
> P;
x * (1 + x^2 * (-0.1666665375232696533203125 + x^2 * (8.332121185958385467529296875e-3 + x^2 * (-1.951101585291326045989990234375e-4))))
> dirtyinfnorm(sin(x)-P, [0;pi/4]);
2.48825949986510374795541070830762114197913817795157e-9


The problem is optimised using 24-bit mantissa floats, so is already in the optimal form for a computer.

Compared to sin() from a run of lolremez, forced to single-precision float:
Code: Select all
> X=x * (SG(9.999999861793420056608460221757732708227e-1) + x^2 * (SG(-1.666663675429951309567308244260188890284e-1) + x^2 * (SG( 8.331584606487845846198712890758361670071e-3) + x^2 * SG(-1.946211699827310148058364912231798523048e-4))));

> X;
x * (1 + x^2 * (-0.16666637361049652099609375 + x^2 * (8.331584744155406951904296875e-3 + x^2 * (-1.94621170521713793277740478515625e-4))))

> dirtyinfnorm(sin(x)-X, [0;pi/4]);
Warning: the given expression is not a constant but an expression to evaluate. A faithful evaluation to 170 bits will be used.
9.0021097355521826457692679614674224316870784965529e-9


Also came across an interesting paper/slide-set from a sony engineer from the ps2 era.

http://www.ue.eti.pg.gda.pl/~wrona/lab_ ... _aprox.pdf
http://roland.pri.ee/raamatud/AI/fast-m ... ons_p2.pdf
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: FFT sample for 2^20 elements

Postby timpart » Thu Jul 10, 2014 11:46 am

Thanks for the extra references. I hope I haven't distracted you too much from your original goal.

Tim
timpart
 
Posts: 302
Joined: Mon Dec 17, 2012 3:25 am
Location: UK

Re: FFT sample for 2^20 elements

Postby notzed » Thu Jul 10, 2014 1:30 pm

Fortunately I have no goals here other than the journey itself and often the side-quests like this are as interesting as the main game. Set end-goals are for work!

I don't have any use for a large fft to start with.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia


Return to Algorithms

Who is online

Users browsing this forum: No registered users and 3 guests

cron