Neural Network

Re: Neural Network

Postby nickoppen » Mon Sep 16, 2013 4:02 am

roberto wrote:All above is just an idea I elaborated in my mind, i cannot assure it is smart, but at least i think it deserve to be explored in real: hope someone with e64 try ti implement this idea and report results.


Hi Roberto,

I've been doing my head in thinking about it too. I've been looking into the different strategies for training and I'm not sure that my proposed structure would work very well for training but I've yet to come up with anything different. I agree with you that we just have to implement it and see what the performance is like.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: Neural Network

Postby censix » Wed Sep 18, 2013 1:19 pm

Hi I have been reading parts of this thread and I think it is great that you guys look at this. I would like to make some comments

This paper:
The Google paper is here: http://research.google.com/archive/larg ... s2012.html. As I understand it, they have shown that training the same network asynchronously one many cores each using part of the training set is as good or better than running the whole training set on the network in a linear of synchronous manner.


Deals with training 'DBNs (Deep Belief Networks). Into that category fall, for example, Neural Nets with more than 1 hidden layer, sometimes many more. These are also called MLPs (Multi-Layer Perceptrons) i.e. 10-100-80-60-1 is a three(!) hidden-layer Neural Network, or MLP (it is common to not count the input and the output layers).

When it comes to training MLPs, there is a BIG difference between training a one layer MLP (i.e. 10-100-1) vs. training a three layer MLP (i.e. 10-100-80-60-1). Training 1 layer MLPs is easy. Training >1 layer MLPs is tricky because the more layers you have, the more slowly the training algorithm converges towards the global optimum solution and the more likely it is that it would converge towards an unsuitable local optimum. So what people are doing these days is to pre-train the individual layers of the MLP one by one, then put everything together to train the entire MLP.

My point is this: I would very much doubt that it is possible to replicate the performance of a N-layer MLP by a larger 1-layer MLP, even if the 1-layer is much, much larger! It may work for small networks with, say, less than 20 nodes in each layer, but very likely wont for the larger ones. The conclusion is that, if one starts thinking seriously about how to implement the training algorithm (Backpropagation) on a E16 or E64, that implementation should be able to handle MLPs with more than 1 hidden layer IMHO.

Cheers
censix
 
Posts: 49
Joined: Sun Dec 16, 2012 7:54 pm
Location: europe

Re: Neural Network

Postby nickoppen » Wed Sep 18, 2013 11:16 pm

Thanks for that censix. There's been a huge amount of work done in this area since I left uni many years ago. Do you know of any papers that compare the performance of deep networks to more shallow ones? I'm interested to see what all of that extra work buys you at the end of the day.

As for the implementation, I'm going to start with a simple, one hidden layer implementation to start with. There are a number of key areas that I'd like to know more about before attempting anything more ambitious. For example, with larger networks I think that local memory constraints might be a problem. I'm also uncertain if my whole architecture is the best one for the job.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: Neural Network

Postby roberto » Thu Sep 19, 2013 8:02 am

censix wrote:Hi I have been reading parts of this thread and I think it is great that you guys look at this. I would like to make some comments

CUT

My point is this: I would very much doubt that it is possible to replicate the performance of a N-layer MLP by a larger 1-layer MLP, even if the 1-layer is much, much larger! It may work for small networks with, say, less than 20 nodes in each layer, but very likely wont for the larger ones. The conclusion is that, if one starts thinking seriously about how to implement the training algorithm (Backpropagation) on a E16 or E64, that implementation should be able to handle MLPs with more than 1 hidden layer IMHO.

Cheers


You can have dubt, but there is no reason to have. (from here in i just repeat what a friend of mine with PhD in Math told me). No matter how many layer you have, a one layer NN(*) can reproduce the behaviour of a multyple NN layer, no matter how many layers and neuron-per layer you have(*). (As i said in other post, to have a talk with math person is usefull).
Neural network is a "super version" of Series of taylor (http://en.wikipedia.org/wiki/Taylor_series): all you can do with Taylor you can do also with NN, but something you can do with NN, you cannot do with taylor.

Cheers.

(*) we don't consider input layer and output layer in the count of layer
(**) maybe to reproduce 10-100-100-100-1 you need 10-1000000-1, but you CAN do. We are not discussing how much it costs (time, power and so on) but only if you can or can not do. you can.
roberto
 
Posts: 39
Joined: Sat Mar 09, 2013 2:01 pm

Re: Neural Network

Postby censix » Thu Sep 19, 2013 9:07 am

Thanks Nick.

Dear Roberto,

I happen to be a mathematician myself, so I would be very interested in the *specific* paper or proof that would support the argument of your friend:

Neural network is a "super version" of Series of taylor (http://en.wikipedia.org/wiki/Taylor_series): all you can do with Taylor you can do also with NN, but something you can do with NN, you cannot do with taylor.


Lets assume for a momemt it is indeed true that a 10-100-100-100-1 MLP can be identically reproduced by a 10-X-1 single layer MLP. And lets assume that we need to have X>=100*100*100 for this to work. As simple calculation will show that it would be MUCH more efficient (=faster) to train the 3-layer than to train the 1-layer because the number of weights is significantly different in both networks:

The 3-layer network has: 100*10 + 100*100 + 100*100 + 100 = 21,100 weights to be trained
The equivalent 1-layer has: 10*100*100*100 + 100*100*100 = 11,000,000 weights to be trained ~ 500 times larger!!!

So if the equivalence holds you would need roughly 500 times LONGER to train the 1-layer network !
I think it is obvious that in such a situation, unless you have unlimited computing power which I unfortunately do not have, one would chose the 3-layer network, even if the training may be a bit trickier as I outlined in my previous post. Hence the need for the E16/64 to support >1 layers.

But as I said. I would be very much interested in knowing about a paper that actually does this comparison properly!

Cheers
censix
 
Posts: 49
Joined: Sun Dec 16, 2012 7:54 pm
Location: europe

Re: Neural Network

Postby roberto » Thu Sep 19, 2013 10:35 am

censix wrote:Thanks Nick.

Dear Roberto,

I happen to be a mathematician myself, so I would be very interested in the *specific* paper or proof that would support the argument of your friend:

Neural network is a "super version" of Series of taylor (http://en.wikipedia.org/wiki/Taylor_series): all you can do with Taylor you can do also with NN, but something you can do with NN, you cannot do with taylor.


Lets assume for a momemt it is indeed true that a 10-100-100-100-1 MLP can be identically reproduced by a 10-X-1 single layer MLP. And lets assume that we need to have X>=100*100*100 for this to work. As simple calculation will show that it would be MUCH more efficient (=faster) to train the 3-layer than to train the 1-layer because the number of weights is significantly different in both networks:

The 3-layer network has: 100*10 + 100*100 + 100*100 + 100 = 21,100 weights to be trained
The equivalent 1-layer has: 10*100*100*100 + 100*100*100 = 11,000,000 weights to be trained ~ 500 times larger!!!

So if the equivalence holds you would need roughly 500 times LONGER to train the 1-layer network !
I think it is obvious that in such a situation, unless you have unlimited computing power which I unfortunately do not have, one would chose the 3-layer network, even if the training may be a bit trickier as I outlined in my previous post. Hence the need for the E16/64 to support >1 layers.

But as I said. I would be very much interested in knowing about a paper that actually does this comparison properly!

Cheers


Hello Censix.

Let me tell you that 500 times is a "prevision at minimum". When on my SERIAL cpu I moved from 2-4-4-1 to 2-16-1, the time needed to train the weights was more than proportional (i suspect esponential) compared to the time needed to train 2-4-4-1's weights. Anyway, at the end, the 2-16-1's weights was correctly trained.

Related "the paper of my friend": i just trusted his words (:-/), cos i have no the technical basis to go deeper asking - and understand - him. he sayd that Taylor can play only with continue functions, but NN can play also with not continue functions, so NN can manage a wider class of problems. Sorry i can't tell you more. (IF ABOVE I SAID SOME WRONG, IS MY RESPONSABILITY not my friend cos im not sure i remember and repear right.)

related to "about 500 time more calculation to do". You have not only to think in SERIAL way...Of course, the large namber of neuron is, the longer time it is needed to be trained cos larger quantity of calculation... BUT *ONLY* IF YOU USE A SERIAL CPU, the normal cpu all of us has in his own computer. But the key of my think is that you have to use a PARALLEL hardware, where EACH neuron of the same layer is processed in the SAME period of time. In this case (parallel hardware) no matter how many neurons are in the layer, cos all of them will be processed at same time: more, it will be more efficent. Let say we have 10-100-100-100-1 network. we consider calculation needed "from input to output", same think can applyed to "back propagation".

1st layer (10->100) need (SUM of a(10)*w1(10)) * 100(neuron) = 1000 mul + 1000 add
2nd layer (100->100) need (SUM of b(100)*w2(100)) * 100(neuron) = 10000 mul + 10000 add
3rd layer (100->100) need (SUM of c(100)*w3(100)) * 100(neuron) = 10000 mul + 10000 add
exit layer (100->1) need (SUM of d(100)*w4(100)) * 1(neuron) = 100 mul + 100 add

total: 1000 + 20000 +100 = 21100 mul and 1000 + 20000 +100 = 21100 add

But because they can be done on *PARALLEL* hardware all node of each layer can be done at same time, (T) so you need 3*T to complete a "wave on" of calculation. So, despite the silgle layer is elaborate in parallel, the whole NN is serial, cos layer "N+1" must wait result of layer "N"
Btw, the multiplication of each node are theyself a parallel process, and also sums cam be parallelized by a sub set of partial sums, (ie to sum 128 elements, you need do sum 64 couple of element, then the 32 results, then thr 16 and so on: so to sum 2^N elements you need only N serial sequential partial sums. for 100 element, -> 7 steps)

Move to 10-1000000-1 layer.

1st layer (10->1000000) need (SUM of a(10)*w1(10)) * 1000000(neuron) = 10million mul + 10million add
exit layer (1000000->1) need (SUM of d(1000000)*w4(1000000)) * 1(neuron) = 1million mul + 1million add

total: 11million mul and 11million sum as you said.

all mul can be done at same time and sums cam be parallelized, even if not completly) --> 1000000 < 2^20 so 20 sequential steps are enough to complete sums of single layer.

(So, let introduce a M and S as unit of time to misure time to do a Multiplication and time to Sum, tinking that to multiply is more slow than to sum so M>S)

YOUR TOPOLOGY
1st layer (10->100) need (SUM of a(10)*w1(10)) * 100(neuron) = 1M + 7S
2nd layer (100->100) need (SUM of b(100)*w2(100)) * 100(neuron) = 1M + 14S
3rd layer (100->100) need (SUM of c(100)*w3(100)) * 100(neuron) = 1M + 14S
exit layer (100->1) need (SUM of d(100)*w4(100)) * 1(neuron) = 1M + 7S

MY TOPOLOGY
1st layer (10->1million) need (SUM of a(10)*w1(10)) * 1milion(neuron) = 1M + 20S
exit layer (1milion->1) need (SUM of d(1milion)*w4(milion)) * 1(neuron) = 1M + 20S

As far as i know in the best hardware, a multiplication need 6 clock, and a sum need 1 clock, Lets assume that a mutiplication costs 6 times the time of sum (M=6S) so you need 4*6 +3*14=66 unit of time, but i need 6*2+20*2=52 unit of time

66:100=52:X ---> X=~78.8%

All above need of course a dedicated hardware, i suspect constraint in actual hardware (adapteva) will penalize the situation. I am convinced that a "designed-for-the-scope" harware (FPGA or, better, ASIC chips) can prove what i said.

Please redo my calculations (As i did for yours) and see if there are some errors.

Cheers,
Roberto
roberto
 
Posts: 39
Joined: Sat Mar 09, 2013 2:01 pm

Re: Neural Network

Postby timpart » Thu Sep 19, 2013 12:25 pm

roberto wrote:But because they can be done on *PARALLEL* hardware all node of each layer can be done at same time, (T) so you need 3*T to complete a "wave on" of calculation. So, despite the silgle layer is elaborate in parallel, the whole NN is serial, cos layer "N+1" must wait result of layer "N"


I don't really follow your reasoning from here. Surely the parallel hardware may only exist in theory? If you want to do a million multiplies at the same time surely you need 1 million multiplier circuits? I've never heard of anyone creating hardware that can do that (but I don't follow such things).

Regarding programming this, why not reuse the matrix multiply program that has already been written for the Parallella?

To calculate the next layer multiply a vector (a 1 x n matrix) of previous layer values by a matrix of connection weights, using a big matrix multiply routine. Then take each entry of the resulting vector and apply the output function to it to calculate the value of the next layer.

Repeat as needed for subsequent layers. (Output vector needs to be transposed to be input to next stage).

Tim
timpart
 
Posts: 302
Joined: Mon Dec 17, 2012 3:25 am
Location: UK

Re: Neural Network

Postby roberto » Thu Sep 19, 2013 1:55 pm

timpart wrote:
roberto wrote:But because they can be done on *PARALLEL* hardware all node of each layer can be done at same time, (T) so you need 3*T to complete a "wave on" of calculation. So, despite the silgle layer is elaborate in parallel, the whole NN is serial, cos layer "N+1" must wait result of layer "N"


I don't really follow your reasoning from here. Surely the parallel hardware may only exist in theory? If you want to do a million multiplies at the same time surely you need 1 million multiplier circuits? I've never heard of anyone creating hardware that can do that (but I don't follow such things).

Regarding programming this, why not reuse the matrix multiply program that has already been written for the Parallella?

To calculate the next layer multiply a vector (a 1 x n matrix) of previous layer values by a matrix of connection weights, using a big matrix multiply routine. Then take each entry of the resulting vector and apply the output function to it to calculate the value of the next layer.

Repeat as needed for subsequent layers. (Output vector needs to be transposed to be input to next stage).

Tim


Tim, in general, we (at least, I) have in mind how to maximize the efficency of training time. To do that, you need massive parallel hardware. The example of one million neuros was used to demostrate (at least, i suppose i demostrated) that one layer network not only is possible, but also is more efficent: the trick is to reduce moe tan possible the numbero pf layer, cfos it introduce delay. All in all, my proposal was (go back to read previous posts) to create the best network for the hardare of parallela, i mean a one layer network (i.e. 2-64-1) because in my opiniopn it *should be* more efficent than 2-8-8-1. More in general N-X*Y-1 (N input, X layers with Y neuron-per layer, 1 output) should be less efficent than N-64-1, with constrait that X*Y < 64. All the above in theory: because i have not parallela to test by myself, i proposed to have it to do the test.

Of course a ephipany, despite with 64 core, is a "toy" if we want to use massively parallel-one-layer-network. But in adapteva plan, they are thinking to 64K core in middle future (2020) so i hope we have not wait to much in the future.

(Now let me dream a while)
Never forget that dedicated ASIC are some magnitude better that programmable core (ephipany, GPU, etc), so, despite their costs, it is possible to develope a fast NN-in-hardware-network. Try to image: a network 2^8-input-2^20-neurons-2^4output, all in and put in double precision. Who care it is not modificable topology and it is a overkill compared to your problem? it has a lot of IN, OUT, and Neurons, enough to serve a wide class of problems and, more important, it is FASTEST compared to everithin else. It still not exist (maybe NSA already has they in some secret laboratory as they had tailored ASIC to force DES algoritm in past), but the quantity of transistor to realize it is ridicolous compate the billions present in a actual CPU, so only the will-to-do is necessary, the tecnology is already present.

Sorry if I diverged from the initial topic.
roberto
 
Posts: 39
Joined: Sat Mar 09, 2013 2:01 pm

Re: Neural Network

Postby censix » Thu Sep 19, 2013 5:10 pm

@Roberto

you calculation is very interesting. It shows that through the use of massive parallellization, a large 1-layer MLP may be even slightly faster to train than a much smaller 3-layer MLP.

But that does not address the core question that I still have and will continue to have until I find a good paper or a good explanation for it. And that is: Is there formal proof that a 3-layer MLP 10-A-B-C-1 is absolutely equivalent with a 1-layer 10-X-1 MLP, where X ~ A*B*C or X being some other function of A,B,C.

Until this is not clear, I would not dare to just replace N-layer MLPs by large 1-layer MLPs and just hope that it will work.

But the point about being able to parallelize(=speed up) 1-layer MLP training more efficiently than N-layer training makes sense!
censix
 
Posts: 49
Joined: Sun Dec 16, 2012 7:54 pm
Location: europe

Re: Neural Network

Postby roberto » Thu Sep 19, 2013 6:08 pm

censix wrote:@Roberto

you calculation is very interesting. It shows that through the use of massive parallellization, a large 1-layer MLP may be even slightly faster to train than a much smaller 3-layer MLP.

But that does not address the core question that I still have and will continue to have until I find a good paper or a good explanation for it. And that is: Is there formal proof that a 3-layer MLP 10-A-B-C-1 is absolutely equivalent with a 1-layer 10-X-1 MLP, where X ~ A*B*C or X being some other function of A,B,C.

Until this is not clear, I would not dare to just replace N-layer MLPs by large 1-layer MLPs and just hope that it will work.

But the point about being able to parallelize(=speed up) 1-layer MLP training more efficiently than N-layer training makes sense!


censix,

I have no a formal proof of what you ask. I can only remark that in past I have a 2-4-4-1 trained. Later, with the same training set, i was able to train a 2-16-1. Anyway, this proof nothing: maybe it is possible to reduce a 2-A-B-1 to a 2-C-1, but not possible to do the same with 2-A-B-C-1 to 2-D-1. Maybe i was just lucky because my training set match the topology. On the other side, there is the "brute" way to do on real : train a 2-2-2-2-1 network and then change topology in 2-X-1, and train it: if you get success, it is the proof that it is possible, even if formal demonstration is still missing. It is not a final solution but a experimental subset. Anyway, give me some days: i will try to contact that friend of mine to ask more math details.

Roberto

PS: what if you can reduce NN-3layer to NN-2Layer? doenst should it means that all N layer can be reduced to N-1, and so, by estention, all NN-Nlayer can be reduced to NN-1layer? But i see, a formal demostration is needed because is able to place the word "END" to the debat.
roberto
 
Posts: 39
Joined: Sat Mar 09, 2013 2:01 pm

PreviousNext

Return to Artificial Intelligence

Who is online

Users browsing this forum: No registered users and 1 guest

cron