censix wrote:Thanks Nick.

Dear Roberto,

I happen to be a mathematician myself, so I would be very interested in the *specific* paper or proof that would support the argument of your friend:

Neural network is a "super version" of Series of taylor (

http://en.wikipedia.org/wiki/Taylor_series): all you can do with Taylor you can do also with NN, but something you can do with NN, you cannot do with taylor.

Lets assume for a momemt it is indeed true that a 10-100-100-100-1 MLP can be identically reproduced by a 10-X-1 single layer MLP. And lets assume that we need to have X>=100*100*100 for this to work. As simple calculation will show that it would be MUCH more efficient (=faster) to train the 3-layer than to train the 1-layer because the number of weights is significantly different in both networks:

The 3-layer network has: 100*10 + 100*100 + 100*100 + 100 = 21,100 weights to be trained

The equivalent 1-layer has: 10*100*100*100 + 100*100*100 = 11,000,000 weights to be trained ~ 500 times larger!!!

So if the equivalence holds you would need roughly 500 times LONGER to train the 1-layer network !

I think it is obvious that in such a situation, unless you have unlimited computing power which I unfortunately do not have, one would chose the 3-layer network, even if the training may be a bit trickier as I outlined in my previous post. Hence the need for the E16/64 to support >1 layers.

But as I said. I would be very much interested in knowing about a paper that actually does this comparison properly!

Cheers

Hello Censix.

Let me tell you that 500 times is a "prevision at minimum". When on my SERIAL cpu I moved from 2-4-4-1 to 2-16-1, the time needed to train the weights was more than proportional (i suspect esponential) compared to the time needed to train 2-4-4-1's weights. Anyway, at the end, the 2-16-1's weights was correctly trained.

Related "the paper of my friend": i just trusted his words (:-/), cos i have no the technical basis to go deeper asking - and understand - him. he sayd that Taylor can play only with continue functions, but NN can play also with not continue functions, so NN can manage a wider class of problems. Sorry i can't tell you more. (IF ABOVE I SAID SOME WRONG, IS MY RESPONSABILITY not my friend cos im not sure i remember and repear right.)

related to "about 500 time more calculation to do". You have not only to think in SERIAL way...Of course, the large namber of neuron is, the longer time it is needed to be trained cos larger quantity of calculation... BUT *ONLY* IF YOU USE A SERIAL CPU, the normal cpu all of us has in his own computer. But the key of my think is that you have to use a PARALLEL hardware, where EACH neuron of the same layer is processed in the SAME period of time. In this case (parallel hardware) no matter how many neurons are in the layer, cos all of them will be processed at same time: more, it will be more efficent. Let say we have 10-100-100-100-1 network. we consider calculation needed "from input to output", same think can applyed to "back propagation".

1st layer (10->100) need (SUM of a(10)*w1(10)) * 100(neuron) = 1000 mul + 1000 add

2nd layer (100->100) need (SUM of b(100)*w2(100)) * 100(neuron) = 10000 mul + 10000 add

3rd layer (100->100) need (SUM of c(100)*w3(100)) * 100(neuron) = 10000 mul + 10000 add

exit layer (100->1) need (SUM of d(100)*w4(100)) * 1(neuron) = 100 mul + 100 add

total: 1000 + 20000 +100 = 21100 mul and 1000 + 20000 +100 = 21100 add

But because they can be done on *PARALLEL* hardware all node of each layer can be done at same time, (T) so you need 3*T to complete a "wave on" of calculation. So, despite the silgle layer is elaborate in parallel, the whole NN is serial, cos layer "N+1" must wait result of layer "N"

Btw, the multiplication of each node are theyself a parallel process, and also sums cam be parallelized by a sub set of partial sums, (ie to sum 128 elements, you need do sum 64 couple of element, then the 32 results, then thr 16 and so on: so to sum 2^N elements you need only N serial sequential partial sums. for 100 element, -> 7 steps)

Move to 10-1000000-1 layer.

1st layer (10->1000000) need (SUM of a(10)*w1(10)) * 1000000(neuron) = 10million mul + 10million add

exit layer (1000000->1) need (SUM of d(1000000)*w4(1000000)) * 1(neuron) = 1million mul + 1million add

total: 11million mul and 11million sum as you said.

all mul can be done at same time and sums cam be parallelized, even if not completly) --> 1000000 < 2^20 so 20 sequential steps are enough to complete sums of single layer.

(So, let introduce a M and S as unit of time to misure time to do a Multiplication and time to Sum, tinking that to multiply is more slow than to sum so M>S)

YOUR TOPOLOGY

1st layer (10->100) need (SUM of a(10)*w1(10)) * 100(neuron) = 1M + 7S

2nd layer (100->100) need (SUM of b(100)*w2(100)) * 100(neuron) = 1M + 14S

3rd layer (100->100) need (SUM of c(100)*w3(100)) * 100(neuron) = 1M + 14S

exit layer (100->1) need (SUM of d(100)*w4(100)) * 1(neuron) = 1M + 7S

MY TOPOLOGY

1st layer (10->1million) need (SUM of a(10)*w1(10)) * 1milion(neuron) = 1M + 20S

exit layer (1milion->1) need (SUM of d(1milion)*w4(milion)) * 1(neuron) = 1M + 20S

As far as i know in the best hardware, a multiplication need 6 clock, and a sum need 1 clock, Lets assume that a mutiplication costs 6 times the time of sum (M=6S) so you need 4*6 +3*14=66 unit of time, but i need 6*2+20*2=52 unit of time

66:100=52:X ---> X=~78.8%

All above need of course a dedicated hardware, i suspect constraint in actual hardware (adapteva) will penalize the situation. I am convinced that a "designed-for-the-scope" harware (FPGA or, better, ASIC chips) can prove what i said.

Please redo my calculations (As i did for yours) and see if there are some errors.

Cheers,

Roberto