Multi-core Neural Net Engine

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Re: Multi-core Neural Net Engine

Postby jar » Thu Sep 28, 2017 7:10 pm

polas wrote:As an aside, quite a few of the algorithms I saw in the literature will store NN weights & values between each feed-forward pass and then combine these together for the back prop at the end of the batch. We don't have the memory for that on the Epiphany, so instead after each feed-forward I combine NN weights & values into a "running total" and then use this directly in the back-prop. It has exactly the same effect, but just a different way of doing it - this raised an interesting point in my mind wrt the changes we will likely need to make to algorithms in future to make them more suited for architectures with very high numbers of cores but low memory per core.


Although backprop on Epiphany is interesting, most all of the current neuromorphic architectures ignore it. The feed-forward seems to be where latency and power matter. Yes, Epiphany-III has limited per core memory and no breakthroughs are likely within 512 kB on an 6+ year old processor, but the architecture can scale well to much larger networks (E5 is 64MB on-chip). And using a general purpose core rather than a special purpose core has benefits for integration into larger applications. Pre- and post-processing of data can be done on the same processor as the neural network. At 64MB, surely there's space for some interesting/useful networks.

claudio4parallella wrote:HELP HELP !! Collaborative work is called, as well as it's interesting for many people !

An easy way to collaborate is to post the code on github. Others can pull, tweak, fork, push, or just view and discuss ideas.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Multi-core Neural Net Engine

Postby claudio4parallella » Thu Sep 28, 2017 7:39 pm

Hi here is my example I'd like to play with.....

here is a zip file [url][/url] that contains:

- bpNN: executable as compiled from https://courses.cs.washington.edu/courses/cse599/01wi/admin/Assignments/bpn.html with
Code: Select all
sudo gcc bpNN.c -o bpNN -lm


- bpNN-host.c : the host bpNN source code
- bpNN-host.elf : the host executable that upload into Core(0,0) for a test the adapted bpNN
- build.sh : the bash file that compile Host and Device parts creating srec
- e-bpNN.c : the device source code
- e-bpNN.elf : the device to be copied into srec
- e-bpNN.srec : the executable that is uploaded into Core(0,0) for test
- run.sh : the run bash command file

I'd like to test if I could execute into the Core the BPNN example .

It's running but very slowly. may be that's it.

Thanks for comments
Last edited by claudio4parallella on Sat Oct 07, 2017 1:24 pm, edited 1 time in total.
claudio4parallella
 
Posts: 68
Joined: Thu Aug 10, 2017 3:48 pm

Re: Multi-core Neural Net Engine

Postby dobkeratops » Thu Sep 28, 2017 8:16 pm

Although backprop on Epiphany is interesting, most all of the current neuromorphic architectures ignore it. The feed-forward seems to be where latency and power matter.


I think this is because the accelerators are destined for robots and end user-devices (e.g. phones), and in data centres they train once, then use many times (i.e. to analyse our searches etc); it's also because that's where there was an opportunity to create simpler , more focussed hardware.

The epiphany would certainly suit these use-cases aswell, but it's also more versatile than the more dedicated accelerators .. as such it should be equally applicable for training.
I would hope the epiphany would be useful to AI researchers for this reason, and some people still want to have 'online learning'

(although I get the impression the more practical way AI will pan out is devices being fed pre-trained nets downloaded from data centres)

anyway .. writing a multi-layer Convolution function with optional clamp & pooling, and accumulation would be applicable to both.

..and now I remember: you can indeed compute the updates to the 'feature-maps'/aka 'filters' by convoluting the back-propagated errors with the derivatives (something like that) .. i remember the overall shape of the update procedure

At 64MB, surely there's space for some interesting/useful networks.

absolutely, there's full vision nets that fit in that, it's also interesting to note it's more on chip RAM than you had in the PS2 console.. that really is a game-changer.

Last time I was engaged in this forum I seem to recall there were issues with a limited memory window r.e. the e-cores ability to request DMA from the host, although I also seem to remember you could use the FPGA to do just about anything (presumably blasting data into the e-cores)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Multi-core Neural Net Engine

Postby jar » Fri Sep 29, 2017 5:46 am

claudio4parallella wrote:It's running but very slowly. may be that's it.


It's running slowly because the code is using double precision floating point, which is emulated in software. To make things worse, most subroutines are executing out of DRAM instead of local on-chip memory.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Multi-core Neural Net Engine

Postby dobkeratops » Fri Sep 29, 2017 11:16 am

jar wrote:It's running slowly because the code is using double precision floating point, which is emulated in software. To make things worse, most subroutines are executing out of DRAM instead of local on-chip memory.


out of interest, could the template library that you showed assist with writing any of this - I think you had a more ambitious plan in mind.. if I remember correctly you showed something that allowed writing quite natural looking code with indexing.

my idea would have been to write a 'convolutional iterator' into which you plug the final 'activation function'' ,

the sticking point to me seems to be how to express exactly how you're going to cache the weights (filter,feature-map) across the cores.. depending on the number of layers, feature map size, number of cores, and core scratchpad size (32k vs 64k) those decisions will be slightly different,

nonetheless it could boil down to 2 regimes..

'1 core per feature, not enough cores' -> run a subset of features in multiple passes, e.g. 16 cores, 64 features = divide into 4 passes of 16 features
'1 core per feature, more than enough cores' -> also parallelize across the image, e.g.1024 cores , 64 features = divide the image into 4x4 regions and process each in parallel
'sweetspot, 1 core per feature' (passes=1, or regions=1)

... there might be other ways of ordering it I haven't considered here(i.e. the opposite, load chunks of image, then stream through the features), but those are the two most obvious IMO

evaluating a typical vision net, those parameters vary per layer of course.. they go from 10's to 100's of features
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Multi-core Neural Net Engine

Postby jar » Fri Sep 29, 2017 2:27 pm

dobkeratops wrote:out of interest, could the template library that you showed assist with writing any of this - I think you had a more ambitious plan in mind.. if I remember correctly you showed something that allowed writing quite natural looking code with indexing.

Yes, perhaps, but most back ends of neural net engines are performance optimized for the hardware and only touched by domain experts. While readable code is nice, it's less of a requirement in this case.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Multi-core Neural Net Engine

Postby dobkeratops » Sat Sep 30, 2017 9:41 am

r.e. pipelining 'network depth', I gather that can be problematic for training (need to backprop using an evaluated state of the whole network..)

and I was just thinking, for robot control, would you want to minimize latency? so you'd always want to parallelize across each layer as much as possible first?..

(I was going to say extending the post above there'd be the option of splitting network layers across different cores.. but that would prioritise throughput over latency?)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Multi-core Neural Net Engine

Postby claudio4parallella » Thu Oct 05, 2017 4:31 pm

THE MODEL

Hi! I'm building up a configurable NN per Core and I'd like to get suggestions from you experts about the possible "Model".
I'm imaging and testing different se up that I'll try to list here into words.

MODEL0: pre convolution of big matrix (images for example) could be made in parallel by the multicore frame to be faster (only the starting convolution phase);
MODEL1: One NN per Core: each Core-NN could be trained and do its duty indipendently. Multicore=Multiduty.
MODEL2: Each Core-NN is trained for the same duty and they work in parallel each one dedicated to a part of the total of inputs (the total of an image is divided into 16 rectangles, each one treated by each core-NN for example to look for a face)
MODEL3: Each Core-NN is trained to do one duty regarding the same input (an image for example) and in parallel the first is detecting Face, the second is detecting cats, the third is detecting dogs..... the last one is detecting characters....
MODEL4: only tha Matrix Algebra for big matrix is parallelized among the cores in order to provide the training faster or working for detection faster
MODEL5: each core could be used in parallel to test different combinations of layers of the NN to realize the more efficient training, with same input and output to be verified

Which is the model that could be of any value up to you for such a MultiCore Board, considering the fact that the elaboration is sequential layer by layer from Input to Output in a continuous loop to reduce the error?

Thanks for your considerations
claudio4parallella
 
Posts: 68
Joined: Thu Aug 10, 2017 3:48 pm

Re: Multi-core Neural Net Engine

Postby dobkeratops » Fri Oct 06, 2017 1:03 am

MODEL2: Each Core-NN is trained for the same duty and they work in parallel each one dedicated to a part of the total of inputs (the total of an image is divided into 16 rectangles, each one treated by each core-NN for example to look for a face)

MODEL3: Each Core-NN is trained to do one duty regarding the same input (an image for example) and in parallel the first is detecting Face, the second is detecting cats, the third is detecting dogs..... the last one is detecting characters....



a mix of 2 & 3 is closest to what I have in mind, but i'd explain it slightly differently (and i'm not talking about keeping a 'NN' permanently on the core, so in that respect , it's closer to 'using it to accelerate Matrix algebra..'. )

summary: use the multi-cores to accelerate convoluting features .. this calculation has many nested loops (4d x 3d..) which can easily go in parallel.

instead of "detect faces, cats, dogs.." - you use all the cores to co-operate to evaluate one layer of the conv-net at a time: "each core evaluates one feature"

the raw calculation looks roughly like this..

(this is pseudocode, many missing details but it explains the overall shape..)
https://www.embedded-vision.com/platinu ... ning-imple

Code: Select all
    {
    for each layer 'l', in series{  // each layer is a 3d array: width x height x features.
         // first layer features = r,g,b
         // early layers 'features'=oriented edges. middle layers, 'features' = eyes noses blobs etc;
         // final layer, 'features'=dogs,cats,etc
        // e.g.
        //Now use ALL cores to evaluate:-
        for each feature 'k' in parallel {
            // assign one core per feature per core
            // i.e. upload the 'feature map' of [layer l, feature k] into core 'k'

            //iterate across the image for this layer & this feature
            //i.e. stream this layer through all the cores
            // (this could *also* be parallelised, e.g. tiling the image across cores , if num_features < num_cores

            for each row 'j' of current layer{         // possibly also divide i,j across cores too
                 for each column 'i' current layer{
                      next_layer[k][j/p][i/p] = convolute_features_and_maxpool(
                                        prev_layer, /* at..*/ i,j,  /* filter..*/ layer[l].feature_map[k], p)

                       // layer weights are a 4d array, i've written it here as several 3d arrays
                       // each layer's 4d array is a different shape, because each layer has a different number of features.
                 }
             }
          }
        }
     }

     //calculation of one neuron..
     function convolute_features_and_maxpool(Image3d& src, int i,int j, Image3d& feature_map, int pooling_size){
        ASSERT(src.depth == feature_map.depth)
        float val=0.f;
     
        for (px=0; px<pooling_size; px++) {  // e.g. for each  2x2 group of pixels..
          for (py=0; py<pooling_size; py++){
             float acc=0.f;  // current neuron   //evaluate a neuron..
             // sum of 2d convolutions for each previous layer feature with that layers' feature map.
             // TODO .. there's 'bias terms' aswell
             for (fx=0; fx<feature_map.width; fx++){     // for each pixel of the filter..
               for (fy=0; fy<feature_map.height; fy++) {
                for (layer=0; layer< src.num_features; layer++){
                 acc += feature_map.get(fx,fy,layer) * src.get (i+fx+px, i+fy+py, layer)
                 //TOTO: I FORGOT BIAS,
               }
             }
             // max-pooling - only return the largest value form the 2x2 block
             // TODO - activation function, but this will work for 'ReLu' (ReLu is just 'x>0?x:0')
             if acc > val  { val=acc}
          }
         return val;
         // TODO - i've not explained details about edges..
         //usually they clip the output, e.g. output width = input width minus feature width,
         // but sometimes they clip the features e.g. let the filters overstep
     }

    //p=pooling size e.g. 2 for 2x2 reduction each layer..

    // each layer's 'weights' are arranged in a 4D array.. [previous_layer_features][next_layer_features][filter_width][filter_height]
    // each layer has a different number of 'features' , and different filter sizes,
    // layer 0 is 3features (R,G,B),
    // layer 1 = edges,
    // layer 2 =... until at the end , the features are 'dog','cat,' etc..
    // the width & height are reduced by 'max-pooling'
    // the internal layers have more features
    // each layer




so for every step of the 'layer loop', you would upload that layers 'feature maps' into different cores, and convolute the previous layer data with it.

i think you will have to bounce the layer data in and out of on chip memory ('stream one layer through, outputting the next layer'), on the E3, but consider how in E5 you could reserve some of the scratchpad space for this purpose (double-buffer the layers). You'll need some sort of tiling because you can't store it contiguously.

You still want to consider 'model 2' , because for the 1024 core chip.. you have more cores than features.
of course with just 16 cores and 32,64,256,1024 features, you need yet another loop across 'subsets of features' (use all the cores to do the first 16 features, then the next etc)

(they talk about 'fully connected layers' at the end, but i think the meat of it is the convolutions and you could set one up that was just convolutions, with 1xN at the end)

(I guess i might need a timeline or animation to explain it clearly)

yet more details... you might need to double-buffer 'uploading' the feature maps? .. not sure.. i'm assuming 'uploading the feature maps' is much less bandwidth than actually convoluting the whole layer.

you will have to think about tiling for the early layers (e.g. how to traverse the 224x224 image with a 32k core..), but layer layers the 'width&height' is smaller.. the division across features is enough

Todo: I think the e-cores have a 'broadcast DMA feature', which could be used to copy the layer to each core efficiently (once you've got the feature-maps uploaded)


I think the layered feature convolution algorithm is highly parallelizable .. dividing up the image, and if you run out of that you could also evaluate it in a pipeline.
with enough cores, you could indeed just keep the 'feature-maps' for each layer permanently on a group of cores, and just pass the layer data between them. Thats a bit more like what you have in mind when you say 'Core-NN' I think.

remember though that convolutional networks *share weights between many neurons* .. it's not really literally like a biological neural net, where you might imagine each core permanently holding a group of neurons.


This is just how I imagine it.. if anyone has actually done this , or tried a completely different approach.. I'd be very curious to know what works.

I've only done this on a GPU with no control over 'scratchpads' (you just hope the 'feature maps' stay in L1 caches, and the 'layer activations' hang around in L2 between passes)

the benefit of Epiphany is that the scratchpads are actually bigger than the GPU caches , and instead of 'hoping', you actually state explicitely upfront how the data moves, and what to keep.

Epiphany really about control of data movement more than parallelism (we already get the parallelism on the GPU)
Last edited by dobkeratops on Fri Oct 06, 2017 2:56 am, edited 6 times in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Multi-core Neural Net Engine

Postby dobkeratops » Fri Oct 06, 2017 2:27 am

This is the function I'd have advocated adding to the PAL library :-

// as in the above explanation, applied across the whole 'input' image, to produce the 'output' image
Code: Select all
void convolute_features_and_maxpool(const Array3d& input, const Array4d& feature_maps, int pooling_size, Array3d& output);

// serial version works as explained above

so you'd call this in series (per layer). the library implementation would use the cores to work across the cores and features. It would have to pick the right strategy based on the actual number of cores, the scratchpad sizes, and array sizes

between E3 and E5 you have an architectural shift, in the ability for the 'Array3d's to be held onchip

You need another version which traverses the feature in reverse (i+fx vs f-fx above).

Once you have those, **you can implement both forward evaluation and back propagation for net training**, by convolution 'error images' (I forget the details but I have done this.. you use a convolution of errors with activations to generate an increment *for the feature map*.)


you might get more insight looking for a CUDA or OpenCL implementation (I personally did it in OpenCL )
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

PreviousNext

Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 6 guests

cron