Multi-core Neural Net Engine

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Multi-core Neural Net Engine

Postby claudio4parallella » Wed Sep 27, 2017 12:14 pm

Hi all!,

please I'd like to share guide lines and experiences about the subject.
I'm studing the SDK for Movidius / Myriad X (puah!) even if there isn't yet the device.
What better than using Parallella Epiphany III with 16 cores to do almost the same.

I'm thinking to create and run a Neural network Environment (Input Hidden and Output nodes) per core, ready tyo be used in real applications.
It should be used both during parallel training and for parallel real time recignition having already the dedicated training data model.

I have seen nnP and FASTNN.
I'm thinking to create a sort of parallel nnP or FASTNN running one per core, if I am able to explain clearly what I've in mind
What do you think about my initial idea ?

Thanks for source, guide lines, links, material to study on and stress my parallella cluster....

My best to all of you
Last edited by claudio4parallella on Wed Sep 27, 2017 7:23 pm, edited 1 time in total.
claudio4parallella
 
Posts: 60
Joined: Thu Aug 10, 2017 3:48 pm

Re: Multi-core Neural Net Engine

Postby jar » Wed Sep 27, 2017 2:58 pm

This sounds like a good project. The E3 on Parallella is rather bandwidth constrained so that in order to achieve performance, all of your network weights will need to be within on-chip memory. But then the on-chip memory limitations limit the size of the network. If you need a larger network, then the off-chip bandwidth limitations can be somewhat mitigated by running large batch sizes.

Good luck
User avatar
jar
 
Posts: 293
Joined: Mon Dec 17, 2012 3:27 am

Re: Multi-core Neural Net Engine

Postby dobkeratops » Wed Sep 27, 2017 5:58 pm

I'm absolutely convinced the epiphany architecture would be perfect for this sort of thing, some similarities to the movidius , but it could go further in future versions (the designed, but not produced 1024 core version..) .. although the existing board might not perform well compared to more recent devices.

The most important algorithm IMO would be Convolutional Neural Networkshttps://en.wikipedia.org/wiki/Convolutional_neural_network, you'd have to figure out the best way to use the local-stores (something like 'storing one filter per core'). e.g. you'd probably parallelise the evaluation of each 'feature map' in layer [n+1] from layer [n], streaming the 'current layer' in across all the cores.

regarding the fact that Movidius has one big shared 'on chip memory', you could still partition off a fraction of each e-core for that purpose (e.g. 32k could be 24k used by the core, and 8k x number of cores as a shared area used by all).. many ways to work.

also consider that CNN's have been shown to work ok with 1bit precision https://news.ycombinator.com/item?id=11320896 , which could help squeezing more on chip. (I hope the 1024core version has a 'pop count' instruction that would help this, they list cryptography extensions so it's likely..)

the 1024core version has almost* 64mb of on chip memory - that's enough to hold the data for some vision nets entirely on chip, https://github.com/DeepScale/SqueezeNet

with enough cores, you could pipeline the whole thing (for inference).. a group of cores *per layer*... the scope for parallelism is practically unlimited (layers x feature maps x width & height across the image).

I regret I never got into actually doing anything here (i never even got a board) .. if anyone did implement this it could grab a lot of attention and increase the chances people get behind the 1024 core version , but there's rival architectures with different approaches ongoing (i note the nvidia-gpus are getting 4x4 'tensor units' to accelerate one step... I still think the e-cores could do it better)>

CNN's are a proven algorithm that seem to handle so many use cases, I think they've done experiments with self-driving cars purely driven by them, etc (e.g. image input, steering output).

(* minus dead cores, whatever % that will be)


multi-layer Convolution with a user defined function applied to the result would be a great primitive in PAL (might need some #define abuse to make this manageable in C), (or some templated library building on what jar has presented)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Multi-core Neural Net Engine

Postby claudio4parallella » Wed Sep 27, 2017 6:21 pm

dobkeratops wrote:I'm absolutely convinced the epiphany architecture would be perfect for this sort of thing......


Wow! many thanks for your news.
I appreciate so much.

Let wait and see...... what I'll be able to do.......
claudio4parallella
 
Posts: 60
Joined: Thu Aug 10, 2017 3:48 pm

Re: Multi-core Neural Net Engine

Postby dobkeratops » Wed Sep 27, 2017 6:44 pm

I wonder if they could use some of the remaining chips to build a grid, for dev purposes

one problem parallela/epiphany has is the architecture scales, but the *software* is hard to scale.

what you can do with 16 cores x 32k each versus 1024 cores x 64k each is qualitatively different... i.e. on the 1024 core device, you can keep an interesting problem entirely on-chip. That is a game-changer (... in how you structure your solution)

the reason GPUs/CPUs have the mainstream is it's releatively easy to just take the same code and run it on a bigger machine , so they could evolve continuously.

I was very interested in jar's template library (recent versions of the compiler apparently make it easier to generate e-core code from a main program): I think this problem I mention is *solveable*, and you could build dataflow primitives (such as convolution) and plug in different 'activation functions' etc to customise for your application
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Multi-core Neural Net Engine

Postby claudio4parallella » Wed Sep 27, 2017 8:38 pm

If I well understand, please let me tell stupid things to start from somewhere, we need a very very light code in order to have a less than 32K sized "Covolution Neural Network" executable to be loaded at each core.
I'm looking for some very simple NN "source code sample" like "genann" https://github.com/codeplea/genann but it's OVER INTERNAL RAM limit when I try to e-gcc it.
Any further "C" NN source code to start with, please?
claudio4parallella
 
Posts: 60
Joined: Thu Aug 10, 2017 3:48 pm

Re: Multi-core Neural Net Engine

Postby claudio4parallella » Thu Sep 28, 2017 1:33 am

I can run this example of Back propagation Neural Network

https://courses.cs.washington.edu/courses/cse599/01wi/admin/Assignments/bpn.html

into each of the 16 or 32 cores individually.

Output is sent to Host Memory for being saved to file .

It works
claudio4parallella
 
Posts: 60
Joined: Thu Aug 10, 2017 3:48 pm

Re: Multi-core Neural Net Engine

Postby polas » Thu Sep 28, 2017 8:07 am

This is a very interesting topic, I have a poster into SC about doing something similar with Python on the Epiphany. The 2017 data science Kaggle was about using machine learning to detect lung cancer in 3D CT scans - I developed a simple NN in Python to do this and using the latest version of ePython (in the dev branch, but will be merged into master in the next few days) to decorate specific functions in the host code to them seamlessly offload them to the Epiphany. Basically offloading the linear algebra kernels involved with the feed-forward and back-prop onto Epiphany cores whilst the host does all other stuff. Each core works on a subset of the input image pixels in parallel.

If I am honest I did this as a test of ePython to really see whether it & the offload directives approach we developed was sufficient for an interesting use-case such as this. Having said that I was quite surprised that ePython performed better than a Python only version running in CPython on both the ARM CPU & a single Xeon core. However when I moved over to doing the linear algebra via BLAS then that completely beat my ePython version hands down.... which isn't particularly surprising I guess.

The major limitation I found (and this is partly a limit of the expressiveness of our offload directives in ePython currently) is the size of the memory - even with shared memory we had to drop down the resolution of the CT scanned images and it limited the size of the hidden layer too (we only used 1 hidden layer again for this reason.) We are currently doing some research on the abstractions around memory hierarchies to enable seamless (and asynchronous via interupts & the DMA engine) copying from the "external" (non-visible to cores) main memory into core accessible local or shared memory which will remove this restriction & if we get it right hopefully we should be able to hide much of the cost of data copying.

As an aside, quite a few of the algorithms I saw in the literature will store NN weights & values between each feed-forward pass and then combine these together for the back prop at the end of the batch. We don't have the memory for that on the Epiphany, so instead after each feed-forward I combine NN weights & values into a "running total" and then use this directly in the back-prop. It has exactly the same effect, but just a different way of doing it - this raised an interesting point in my mind wrt the changes we will likely need to make to algorithms in future to make them more suited for architectures with very high numbers of cores but low memory per core.

Cheers,
Nick
polas
 
Posts: 46
Joined: Thu Mar 05, 2015 9:41 pm

Re: Multi-core Neural Net Engine

Postby claudio4parallella » Thu Sep 28, 2017 11:50 am

claudio4parallella wrote:I can run this example of Back propagation Neural Network
https://courses.cs.washington.edu/courses/cse599/01wi/admin/Assignments/bpn.html
into each of the 16 or 32 cores individually.
Output is sent to Host Memory for being saved to file .
It works


HELP HELP !! Collaborative work is called, as well as it's interesting for many people !

The BPNN example by the link above, is running slowly on one Core as a test, if compared to the ARM.

May be a contribute to slow down the execution is some debugging messages about the progress of work of the CORE that it's sending to HOST, but it's too slow up to me.

Please could we coordinate some attempts on this matter ? I'll send you the copy of BPNN already adapted to run over a CORE, loaded from HOST.

Thanks and regards
claudio4parallella
 
Posts: 60
Joined: Thu Aug 10, 2017 3:48 pm

Re: Multi-core Neural Net Engine

Postby dobkeratops » Thu Sep 28, 2017 12:24 pm

claudio4parallella wrote:If I well understand, please let me tell stupid things to start from somewhere, we need a very very light code in order to have a less than 32K sized "Covolution Neural Network" executable to be loaded at each core.
I'm looking for some very simple NN "source code sample" like "genann" https://github.com/codeplea/genann but it's OVER INTERNAL RAM limit when I try to e-gcc it.
Any further "C" NN source code to start with, please?


so there's 2 angles, neural nets, and convolutional neural nets; Really it's the latter that get all the buzz, because of their applicability to image based problems (camera input etc). The implementation is quite specific to the shape of the data. a simple 'neural net' tutorial wouldn't take you in a useful direction IMO, they tend to hard code the 'hidden layer'. What we need here is convoluting 'layer n' -> 'layer n+1' (the layers each being a 3d array having a different number of channels, and the filters from one to the next being a 4D array.. previouschannels X features X kernel_width X kernel_height), and you just run it across as many layers as you want.

I've seen a convolution function in pal, https://github.com/parallella/pal/blob/master/src/dsp/p_conv.c (but you need to find if they have an e-core version).

This could be a building block toward implementing CNN, but I think it will take a lot of fiddly attention to actually adapt.. nonetheless this would be the best starting point I can imagine.

Once done this would be a hugely powerful function to build on. Basically you need to extend it to be multi-channel, and add an 'activation function'. Initially I would suggest just doing 'ReLu' (max(x,0)) , or generalise that a little to just 2 parameters, a clamped output range (e.g. min_output, max_output).. add a 'bias parameter' too. actually it would make sense to build the 'pooling function' right in there too, i.e. convolute and find the max of 2x2 blocks (set the pooling size as yet another parameter).. because that could happen on chip (accumulating temporaries and writing out a smaller amount of data), it would cut out a huge amount of off-chip communication.

I'm sure that would find other uses in image processing etc (e.g. hard-coding edge detectors, YUV conversion etc etc)

I think 'multi-channel' ability could start out with strides, e.g. width x height x depth with interleaved channels would look quite similar to a convolution strided by 'depth'.

If I remember right I think you can implement CNN training using a convolution function , you just keep extra channels for the 'error' terms etc. you might need an additional variation for flipping the filter. (perhaps 'convolute - and - accumulate' would be a useful operation too, e.g. dest_image = source_image convolute filters + dest_image).

I've only personally implemented CNN in OpenCL

I've no idea how much code it would be , but I'd be surprised if it's problematic.

(you mention that you've been looking into the Movidius SDK, i've not seen it in detail but I seem to remember that chip has 'fixed function units' for various image processing steps, as well as the general purpose SHAVE processors, and they describe setting up pipelines; I imagine a convolution 'engine' might help emulating similar functions on the parallela. Perhaps it would be possible to mimic their architecture for ease of moving ideas across..)


Just writing 'inference' (the forward eval, which is pure convolutions/clamp/maxpooling) and using trained models from GPUs would itself be a big step in showing the usefulness of the parallela architecture, but I'm sure parallela could accelerate the training too.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Next

Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 14 guests