Parallella Community

Posted: **Tue Oct 13, 2015 6:11 pm**

I made that image quite a while ago and I think you're correct to point out that it has 16 compute units each with one processing element. (the image is incorrect). It was generated before I understood the specifics of the COPRTHR OpenCL implementation.

You can use OpenCL to program the Epiphany cores. You should create 16 work groups with 1 thread per work group. Creating more than 16 work groups or more than one thread per work group will oversubscribe the hardware and result in worse performance.

OpenCL, as the standard exists, lacks a mechanism for inter-core communication and is just one of the many failures of the Apple/Khronos API design. The model was designed with GPUs from 2008 in mind because CUDA was winning. The OpenCL C language specified memory locality, but not accessibility. Accessibility was implicitly determined by the locality within the standard. Thus, the standard fails to account for architectures like the Epiphany. Either you break the OpenCL standard by introducing non-standard communication mechanisms resulting in non-portable code, or you keep to the standard and accept the poor performance achieved by global memory synchronization (a weak point with the current Parallella/Epiphany design).

If you're writing OpenCL applications for the Epiphany cores, you will probably be reading and writing to global memory rather than reading/writing to neighboring core local memory. The OpenCL private and local shared memory are the same thing within Epiphany. But since you have a work group size of 1, shared memory is a silly concept (shared with one core).

The OpenCL concept of constant memory does not exist in hardware on the Epiphany cores, but each core has access to 32KB of core local memory which can have small constant data structures replicated across each core.

Hope that helps you. Sorry for the misunderstanding with the old figure.

Posted: **Mon Oct 19, 2015 9:14 pm**

Hi ,jar

thanks for your answer very much.

I summarize that you say:

If I want to program with "Standard" OpenCL in parallella is very stupid idea. Because the parallella board must to access variable in global memory right?
And you think that use the epiphany sdk can achieve the inter-core communication to get more performance.

If I am misunderstood you say, please let me know.
thanks you again

Eric

Posted: **Tue Oct 20, 2015 9:50 pm**

I think you got it so far, but there's a little more to it...

If your application has an very high arithmetic intensity (>100 ops/byte) then you can certainly use OpenCL with with off-chip (global) memory access. Most applications do not fall into this category. I'll explain...

Global reads with current firmware run around 80 or 90 MB/s and writes are about 3x that. Peak performance is around 19,200 MFLOP/s. Let's say you write excellent code and it can achieve 50% of peak performance. You'll need 50%*19200 MFLOP/s/90 MB/s = 106 FLOPs/ byte for applications not to become bandwidth-bound. Since a floating point value is 4 bytes, that corresponds to 424 floating point operations per floating point value.

This is why on-chip data re-use is such a hard requirement. You're already fighting against the tyranny of the global bandwidth. And to further pile on, the Khronos OpenCL specification does not address inter-core memory access.

Because Adapteva and Browndeer Technology are pretty open about things, you have access to the ESDK routines if you would like to extend your OpenCL code with inter-core communication optimizations.

Posted: **Wed Oct 21, 2015 4:03 am**

Hi , jar

Thanks for your explain.

I got it clearly .

Eric

Parallella Community

Some questions about OpenCL programming and epiphany

Some questions about OpenCL programming and epiphany

Re: Some questions about OpenCL programming and epiphany

Re: Some questions about OpenCL programming and epiphany

Re: Some questions about OpenCL programming and epiphany

Re: Some questions about OpenCL programming and epiphany