Multi-core Neural Net Engine

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Re: Multi-core Neural Net Engine

Postby claudio4parallella » Fri Oct 06, 2017 8:24 am

Hi, all very interesting.

I'm staying with feet on the ground.
- Epiphany IV or V do not exist (at present, for public usage).
- Movidius is doing something with 6 cores even if with an hardware appearing dedicated to accelerate NN and Covolution ort simply fast memory usage...
- we do not like Cuda, here with Parallella
- we do not have OpenCL (coprthr1.6, old image only)
- we have e-lib only
That's the context within which I'd like to play something with the 16 cores (or 32 core in cluster I have) for easy exercises in a classroom.
claudio4parallella
 
Posts: 60
Joined: Thu Aug 10, 2017 3:48 pm

Re: Multi-core Neural Net Engine

Postby dobkeratops » Fri Oct 06, 2017 12:33 pm

claudio4parallella wrote:Hi, all very interesting.

I'm staying with feet on the ground.
- Epiphany IV or V do not exist (at present, for public usage).
- Movidius is doing something with 6 cores even if with an hardware appearing dedicated to accelerate NN and Covolution ort simply fast memory usage...
- we do not like Cuda, here with Parallella
- we do not have OpenCL (coprthr1.6, old image only)
- we have e-lib only
That's the context within which I'd like to play something with the 16 cores (or 32 core in cluster I have) for easy exercises in a classroom.


- Movidius is doing something with 6 cores even if with an hardware appearing dedicated to accelerate NN and Covolution ort simply fast memory usage...

the movidius units are SIMD and VLIW, so it's really 2+ levels of parallelism (12 cores x many SIMD lanes x VLIW ). I think it's 128bits wide ( e.g. 4x32bit, 8 x16bit 16x8bit), so more like '96 calculations in parallel' (neural nets do fine with 8 or 16bit precision) .. the memory model is also slightly different in that it's conceptually one big shared memory pool that they consult (relying on the wide SIMD registers for managing local data). the 'layer data' would sit there. You're going to have to dive in deeper to achieve the same result tiled across the individual e-core memories (even though they can see each other). SIMD can also be tricky but in a different way.

https://uploads.movidius.com/1463156689 ... tBrief.pdf I don't expect the parallela to match this device (it's 2mb of 'shared memory', vs 16x32k=512k on the e-cores, tiled between them, minus how much memory goes on code).

- Epiphany IV or V do not exist (at present, for public usage).


sure, but E3 is pretty much 'a devkit for proving algorithms to move to E5' ..

E3 will be out-performed by modern devices - e.g. a GPGPU implementation on cheaper raspberry Pi will do better. (infact the later r-pis have 4 cores x 4 SIMD lanes in the CPU.. even that might do better) .

IMO the point of doing this is to convince people to get E5 built.

I strongly believe Convolutional Neural networks would be a killer app (and indeed many other devices are being built with this in mind). CNN's are a proven algorithm , because they share the weights across an image (e.g you are looking for similar features across it, then features combine in a cascade)

- we do not like Cuda, here with Parallella
- we do not have OpenCL (coprthr1.6, old image only)


There's overlap in approach though. These 'conv-nets' running on GPUs still rely on keeping data on chip (in the caches), and on massive parallelism (e.g. across the image , and across the features - to utilise 1000's of pipelines). And if your GPU code isn't "cache coherent" it will run 1/10th the speed. The difference is the 'cache-coherence' has to be expressed explicitely here. The GPUs do ok with this because they *are* designed for image processing, i.e. traversing frame buffers with repeated texture maps combined by shaders, and various 'post-processing steps' like SSAO,HDR blooms.. so they did adapt to conv-nets quite well.

for easy exercises in a classroom.


It's not going to be easy :) you're going to have to take these clear algorithms (written with loops) and figure out how to map them to DMA transfers. We had the same idea presented in gamedev with the PS3 (CELL processor) , and the industry (which is actually full of hardcore low level programmers) rebelled because it was so different .. game developers pushed for traditional GPUs instead ( 'compute-kernels' to accelerate game physics). Expect to put in roughly 4x as much programming effort to achieve a particular result, compared to other architectures.

The issue is that it is very hard to adapt/re-use code - you have to think abut how an algorithm sits with other parts, and heavily re-work it based on the context. But it's not impossible.

I do still believe it's possible to come up with a high level programming model, based on high order functions expressing a dataflow, and you plug in different kernel functions inside (like 'map-reduce' but more specific options).

Jar has been experimenting with templates (which are now possible, i.e. getting a main program to generate e-core code).

What I would like to see is a 'templated convolutional iterator' in library code (which could hopefully emerge from this effort..) , then you'd have something re-useable.

In some ways the epiphany gives you a red-herring in that the e-cores can 'read' from anywhere - this makes it easier to get code to run on it, but at this point it will actually be *much worse* than on a CPU, where it was designed to use a cache. That makes the exercise look deceptively much simpler..

In this example, if you figure out how to use the DMA and scratchpads for layered convolutions (how to do the tiling), you could later look into generalizing it by allowing plugging in a templated 'activation function' and 'pooling function'. but the traditional tricks like passing function pointers around wont work (because the e-cores have different instruction memories).

In the pseudocode I posted, I just used 'maximum' for both pooling and activation (starting with 0 to acheive 'relu') but people also use different functions (average pooling, and many different ideas for activation), and maybe even the means of combining the image & filter, and layers (but default to multiply,add) .
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Multi-core Neural Net Engine

Postby dobkeratops » Fri Oct 06, 2017 1:18 pm

I actually think trying to dive in with a complete working general purpose neural net engine is probably too much; I would guess that trying to adapt an existing framework will just cause confusion and fill the code with red-herrings, sending you down blind alleys.

Yes it can "run C" .. but it can't re-use the overall approaches seen in existing CPU source code (push with DMA, vs pull from random sources via pointers etc).. I saw this with CELL , starting with code from other platforms was a disaster..

it's probably better to start just with convolutions, then figure out how to do those efficiently (i.e. how to tile across the cores.. has anyone done this yet?), then extend it to *multi-feature* convolutions. Just do image processing examples.
(the function I see in PAL is just 'single channel.. you might be able to adapt it with interleaving, or it might be better to code multi-channels specifically), then use that as a primitive for running 'forward inference' for convolutional nets.. just try to use a net already trained on a GPU.
(As I mentioned , it *is* then possible to express the back-propogation calculation for conv-nets using convolution operations.)

(are there implementations of edge-detector algorithms for the e-cores?)

Vision nets take a long time to train (hours, days .. weeks... ) can you imagine how much hell you're in for if you're starting out with a device performing 10x as slow (hous/days/weeks become days/weeks/months) .. far better to train the net on your PC GPU, or get one ready trained.

I suppose you could look at a scaled down example, e.g digit recognition , with the assumption that you could increase the layers to deal with more elaborate vision examples later
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Multi-core Neural Net Engine

Postby dobkeratops » Sun Oct 22, 2017 1:22 am

this paper might be interesting (i'm guessing it will have been posted elsewhere but i remembered this thread)
http://ieeexplore.ieee.org/abstract/document/7726118/
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Previous

Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 7 guests

cron