Page 2 of 2

Re: OpenCL __local memory

PostPosted: Mon Dec 07, 2015 9:18 am
by dobkeratops
jar wrote:Epiphany doesn't really have a hardware workgroup and it can access all memory at any location.

I know. but, did you see my diagram?

It can't access all memory equally well -there is increasing latency with distance. So how about pretending a bunch of nearby cores form a 'workgroup', and a fraction of their scratchpads is treated as '__local' instead of '__private'. They all need a pointer to their top-left, and their 'local' objects would be relative to that I think the concept actually maps reasonably well. I've sometimes seen people on these forums suggest 'use memory on a neighbouring core' when trying to deal with larger problems.

Read on for an idea on how to leverage arbitrary writes anywhere aswell *...

There's no programming mechanism within OpenCL to allow thread 0 in workgroup 0 to access __private or __local memory from thread 0 in workgroup 1, although Epiphany could do it.

You don't need that to get *some* use of the inter-core dressing. The point is threads that access each others memory have to have some sort of common intent.

You *could* even make the whole chip one workgroup, if you really wanted? am I missing something? But I still wouldn't want to.

In my opinion, it would be good to just let OpenCL be and not try to force Epiphany to conform to it. The Epiphany architecture is much more capable than the virtual OpenCL device model.

My view is rather different.

OpenCL is a source of established real world software that is parallel aware - the salient point being the assumption that kernel invocations within a batch do not overlap, and the need to be explicit about dividing lines where data can be synchronised between stages. ('all these kernels must complete, before data is available to the next..')

The epiphany's problem is: lack of software support: and the complexity of writing bespoke software (look what happened with CELL). A chicken egg situation; without code, people wont' demand it, hence the potential won't be realised with the 256, 1024 core versions. People have a strong incentive to write for the openCL model because its' available allover the place.

How can you justify committing to writing epiphany software today when there are plenty of ARM boards with openCL-capable GPU's already, and you don't know if you'll ever get the 64,256,1024 e-core chips. A device could appear designed to run OpenCL that *does* support divergent flow; you've got a higher chance of success committing to OpenCL software than e-core specific software.

So what we need is a way of developing software portably that can run both on *epiphany* AND something else more established, be that multicore CPU, GPU, or clusters. (its' a shame the window with CELL was missed, IBM ditched that).

Regular C/C++ doesn't cut it; the level of transformation required is so vast that its' almost misleading to even say 'it runs C'.
(although elsewhere I've posted contrasting ideas on that, what would help there is a compiler capable of extracting ELFs from a single translation-unit via directives)

Never mind inter-core access: the real issue is how you deal with large off chip datastructures; and never mind the parallela board itself (think of that as a prototype/devkit); a useful implementation will need to be like IBM CELL in the PS3, in its' ability to work on tasks in main memory via DMA, with the local stores really just being "software-managed cache". I understand the parallela is basically crippled for DDR access.

Even if we get the huge scratchpads imagined with future stacked memory, I would suggest adapting the above diagram further to consider a fraction of each scratchpad as 'global memory', which happens to contain what you'd put in chip memory today. Then you DO get inter-core writes working just fine. With the latency issue, you'd still have to DMA chunks for 'reads' surely.

The complexity seems to be, that its' a DMA based architecture rather than a cached architecture (openCL would allow running on a unique machine with non-coherent caches & divergent flow, IMO, since you have well defined synchronisation points ... who knows what future chips will have).

.. but it would be possible to do shape-analysis on openCL kernels, transforming them with examples built into the compiler. (I invisage this being like a discovery mechanism for dataflow templates)

This isn't easy, but I do believe it's a tractable problem.

There will be a simple set of cases where the read and write indices have a linear relationship to the kernel index - So these could be run within a loop that DMA's the next block in , and provides the kernel with an adjusted dress.
<code> for (each chunk){ DMA next chunk into double buffer; set source pointer to the other buffer; for (each index in chunk) { the kernel... } } </code>.

In the case of random indexed 'gather', you'll have to do some sort of software-managed cache, which will be harder on the epiphany but still not impossible. We know fast code still usually exhibits locality.

*I realise the e-cores could handle inter-core 'scatter' operations efficiently all over the chip (since a store doesn't care about latency): in the big-core future with '__global memory on chip' this could be achieved with tiled arrays. Compile an openCL indexed write e.g. <code> foo[dst_index]=my_value; </code> into <code> foo_tileptr[dst_index/TILESIZE][dst_index&(TILESIZE-1)]=my_value; //foo_tileptr's are copied in my scratchpad; the addresses are allover the chip </code> like a 'software MMU' I guess. T foo[N] becomes something like vector<unique_ptr<array<T,TILESIZE>>>.

In the simpler case of a linear access pattern you might even be able to devise a task manager that sends kernel invocations to the cores that happen to be holding the data, that goes against the local memory idea, but not all kernels use that.

In the 'big scratchpads' case you might even think about 'sending the code to the data'

Whilst this all might sound horrendously complicated, could the work described here be applied to compiling OpenCL for clusters (DMA becomes network messages) or fancy quad-SLI systems (each of 4 GPUs treated like a large scratchpad, with the same transformations allowing you to work with say 16gb datasets, with the same code you'd run on one GPU). (its' such a shame CELL has been and gone.. again this thinking would be applicable to OpenCL for that).

There are literally millions of programmers on the planet, and compilers already do extremely complex transformations.. we've got the tools like LLVM and clang out there in the open giving us AST analysis and IR to work with.

Re: OpenCL __local memory

PostPosted: Mon Dec 07, 2015 11:34 am
by dobkeratops
Code: Select all
e.g. for   +-------------------------------+
 16 core   |   16x(1/4 of each ScratchPad) |           scale up appropriately,
  chip     |     'global memory'           |             given 128k pads you'd
    +----------------+----------------+    |          assign more% as 'global'
    | 4x(1/4 SPads)  |     workgroup  |    |
    |'local memory'  |    2x2 cores   |    | local memory=
+-------+-------+-------+-------+     |    | ability for near cores
|1/2 SPs|       | core  |       |     |    | to share data fast
|private|       |       |       |     |    |
|Ins+Stk|       |       |       |     |    | global memory=
+-------+-------+-------+-------+     |    | all cores, but long
|       |       |       |       |-----+    | latency on average
|       |       |       |       |     |    |
|       |       |       |       |     |    |
+-------+-------+-------+-------+     |    |
|       |       |       |       |     |    |

one scratchpad
[eg     1/2        |   1/4   |   1/4   ]   of this spad
[code|stack|private|   local |  global ]
[    |     |       |   tile  |  tile   ]
                       |         \-------------- one of 16 tiles on whole chip
                       \-------------- one of 4 tiles in workgroup

voila! a sliding scale between locality and visibility
 leveraging the inter-e-cores load/stores

With bigger future scratchpads
you may afford a higher fraction for 'global'

Memory is discontiguous, but map linear 'openCL arrays'
to tiled-arrays, (maybe strides between cores?)

OpenCL sheduler could look at data-pointers
and try to put threads close to the global
data that they read

Re: OpenCL __local memory

PostPosted: Tue Jul 12, 2016 4:53 pm
by smoothy
Hi jar,

thank you so much for your explanations.
As far as I understand, you get best performance in local memory by keeping the all memory banks active through balancing. I don't quite understand though how you do this on the Epiphany if this concept of local memory normally used in OpenCL doesn't apply to the epiphany architecture.