Parallella Community

by **nickoppen** » Mon Sep 02, 2013 2:00 am

I've spent too much of my life as an analyst! In the old days I would have just started coding.

Anyway, in expectation of my e-16 board arriving any day now, I've started writing up my ideas about how best to program a Feed Forward - Back Propagation neural network to get the most out of the chip. Have a read here:

and feel free to leave comments.

Thanks,

nick

by **roberto** » Mon Sep 02, 2013 9:10 am

Before to be entusiast to move your Neural Network on Ephfany, remember that that a NN is there are partial serial, partial parallel alghoritm, so never forget Amdahl's Law: http://en.wikipedia.org/wiki/Amdahl's_law. So, to increment layer make worst the efficency (means less parallel calculation and more serial calculation).

To summarize, it is better a unique layer (1-layer is equivalent to N-layer if the 1-layer has enough neurons). Ideally, you should do 1NEURON=1CORE.

I did a NN as "2 INput--> 2layer of 4 neurons-->1OUTput", later I rewrote it as "2 INput--> 1layer of 16 neurons-->1OUTput" in my x86, but because it is totally serial calculation (i mean, no parallelization of that can be parallelized), I don't know what was the best under efficency point of view. In epiphany it will be nice to do the comparison.

My two (euro)cents.

by **nickoppen** » Tue Sep 03, 2013 3:48 am

by **timpart** » Wed Sep 04, 2013 1:25 pm

I assume from the comments that people have been making that they are more interested in processing a network with a large number of nodes rather than a tiny network very frequently.

If the network has one or more hidden layers, and the hidden layer has more nodes than either the input or output layers. I wonder if a hybrid approach would work best. Split the hidden layer up among the cores and process with respect to it. This means that the method of calculating hidden from its inputs and the hidden to its outputs would use different pieces of code so it is hidden layer oriented.

Personally I'd avoid intercore communication of intermediate results if possible. The time taken to share an intermediate result might well outweigh doing it all in one core, especially if a two way conversation is needed. If a large number of nodes is involved, the time taken for the core with most work to finish shouldn't be much more than the quickest assuming an even as possible spread. (If you have tiny networks then this is much more of an issue.)

Tim

by **nickoppen** » Wed Sep 04, 2013 11:11 pm

Hi Tim,

I definitely agree that adding more cores will eventually show the whole thing down. Imagine a network with 8 input and 8 hidden nodes running on an e-64. Using my strategy that will give each core one multiplication to do followed by a lot of shuffling. The cost of the inter-core communication is the biggest unknown for me now. I think that finding the sweet spot will take some experimentation.

Your hidden layer oriented model is interesting. I'll put that on the slow cooker. It might work better from training which is what I'm working on at the moment.

nick

by **timpart** » Thu Sep 05, 2013 12:32 pm

Another thought struck me this morning. Neural networks are good at "fuzzy" stuff and don't need exact values. Andreas revealed in that if you put the processor in floating point truncate mode rather than rounding mode then the float instructions are one cycle quicker. Normally I'd recommend that people avoid truncated arithmetic, but in this case I wonder if it is worth giving a try. At least for the multiply and sum part, possibly for the function evaluation as well. If you do this presumably you would have to do the training in the same mode. Considering the number of cycles it would save, and the ease of doing, I think it would be worth some experimentation. Do you know of any research in this area?

Tim

by **nickoppen** » Thu Sep 05, 2013 11:22 pm

Hi tim,

That sounds like something that worth a look. I'm pretty sure that truncating will not cause any problem for the final outcome at all.

I was initially thinking that I should use doubles rather than floats but I found that floats were fine. Trying to decipher Andreas' hardware techo speak it seems as if I would get a 20% improvement in speed if I'm happy to have a little bit of inaccuracy in the lowest order digit. Am I right there?

I don't really read much research. I work full time for a mining company and have a wife and 2 kids - not really much time to read widely. I'm really a "get in and try it out" sort of guy anyway.

nick

by **Gravis** » Fri Sep 06, 2013 12:24 am

by **ysapir** » Fri Sep 06, 2013 1:01 am

by **nickoppen** » Fri Sep 06, 2013 6:02 am

Parallella Community

Neural Network

Re: Neural Network

Re: Neural Network

Re: Neural Network

Re: Neural Network

Re: Neural Network

Re: Neural Network

Re: Neural Network

Re: Neural Network

Re: Neural Network

Re: Neural Network

Who is online