Page 1 of 1

[Paper] A Distributed Shared Memory Model and C++ Templated

PostPosted: Fri Apr 28, 2017 12:59 am
by jar
Title: A Distributed Shared Memory Model and C++ Templated Meta-Programming Interface for the Epiphany RISC Array Processor
Abstract: The Adapteva Epiphany many-core architecture comprises a scalable 2D mesh Network-on-Chip (NoC) of low-power RISC cores with minimal uncore functionality. Whereas such a processor offers high computational energy efficiency and parallel scalability, developing effective programming models that address the unique architecture features has presented many challenges. We present here a distributed shared memory (DSM) model supported in software transparently using C++ templated metaprogramming techniques. The approach offers an extremely simple parallel programming model well suited for the architecture. Initial results are presented that demonstrate the approach and provide insight into the efficiency of the programming model and also the ability of the NoC to support a DSM without explicit control over data movement and localization.

Comments and discussion appreciated

Re: [Paper] A Distributed Shared Memory Model and C++ Templ

PostPosted: Thu May 04, 2017 11:35 pm
by dobkeratops
Just reading through it... I see 'parallel_for', taking a lambda .. looks very interesting. I need to read it again more closely.

If I've understood correctly this is exactly the sort of thing I was after in earlier brainstorming posts?

I notice details r.e. how you map the calculations onto the grid? ... I see a mention of offsets in array indexing.

The application of the CLETE-2 package requires a compiler that correctly implements the C++17 standard specification and also correctly optimizes C++ template partial specializations to produce efficient code. In this work we utilize the GCC 5.4 complier for targeting the Epiphany processor. We additionally rely on the COPRTHR-2 SDK which provides run-time support for the Epiphany processor including support for fast SPMD direct co-processor execution, without requiring offload semantics or co-design with the ARM CPU on the Parallella platform. As a result, the compilation and run-time environment used in this work resembles that of an ordinary Linux platform with a multi-core processor.

So basically that's the holy grail as I see it.. the ability to write portable code that can run on the e-cores, but also other parallel processors, so long as they're not too dis-similar. (write once, run on e-cores, clusters, GPU..)

What I had in mind was building more elaborate 'high order functions' (various combinations of map / gather / filter; etc) which could express the dataflow, to give the epiphany implementation more opportunity to leverage the scratchpads/DMA;
if I've understood correctly, perhaps those could be built directly (as helper code) on top of what you demonstrate here.

But perhaps this technique is doing all that already through templated types for the indices, with a lot of TMP magic to compile to something efficient.

Is this proprietary ( I see '2U.S. Army Research Laboratory'..) .. or can it appear in the SDK ; Are you able to put any of this on GitHub ?

I still don't have a parallela myself .. I continue to mess with regular GPUs, openCL. Knowing the 1024 core chip exists does dramatically increase the motivation to write suitable code for it.

Re: [Paper] A Distributed Shared Memory Model and C++ Templ

PostPosted: Fri May 05, 2017 5:15 am
by jar
I thought you'd like this. Yes, this is the similar to the thing you were brainstorming, but you were ahead of your time. GCC wasn't ready (at least version 4.8 with the older Linux image) as well as some of our software.

And it's not ready for prime time yet. This was an early experiment on Epiphany as a side project from the main effort. There is a lot left to improve, but the intention is to place this on GitHub at some point and it won't just be for Epiphany. We would like to delay this as long as possible after witnessing what happened with Kokkos -- they released an unfinished product on the DOE in a panic to have some semblance of code portability between their next Xeon Phi and Power/GPU supercomputers. The end result was that many things are completely missing or unrefined and to properly fix it would break codes.

I don't think we want to begin implementing 'high order functions' but rather enable expressions to be written that compile to efficient code. But I'll keep it in mind. It's not a library though it might be considered a header-only library. It actually can't be pre-compiled and shipped as a proprietary package, so it must be open source if anyone will use it.

The memory layout accessors vary between architectures and platforms. It's a single like of code appearing in an application header that defines memory layout and has a certain complexity to it. Each platform will have defaults, but it's a memory-layout-first approach to parallel computing. The parallel kernel code will remain the same and the expression templates handle the rest. Each platform may have specific optimizations baked into the layout description.

Re: [Paper] A Distributed Shared Memory Model and C++ Templ

PostPosted: Fri May 05, 2017 2:34 pm
by dobkeratops
but rather enable expressions to be written that compile to efficient code.

If the underlying template 'magic' does exactly the same job, then great.
I could simply implement my own idea as helper code ontop. It sounds like this library is actually more ambitious /general already.

I'm sure it would just take a few examples to make it clear how it works.