generating epiphany code from templates, possible?

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

generating epiphany code from templates, possible?

Postby dobkeratops » Sun Aug 09, 2015 8:08 am

Working on an SMP machine in C++, its' possible to create parallelism helpers using templated high-order functions (with templated classes generating & managing thread functions), which are very user-friendly to call. (see example below). This is in the spirit of 'map-reduce', with more specialisation, and maybe finer-grain.

GOAL
- the ability to write complex (e.g. >=100kloc) C++ in such a way that you can compile the same sourcebase (written using suitable high order functions) for a variety of parallel platforms:
-(i) traditional SMP, (ii)epiphany,(iii)networked clusters, (iv)APU, (v)a PS3's Cell ..
- Done by writing per-platform high-order functions , and other work in the toolchain
- Should be possible to nest high order functions e.g. par_map(src,[](auto& y){return par_map(y,[](auto& x){return..});})
-(existing multicore CPU code is like this,because a single thread does inner parallelism with pipelines,SIMD)
- along the lines of 'map-reduce', but many others expressing specific data flows optimisable with DMA , cores emulating caches/memoized functions, whatever.
- no modifications to the bulk of the application per platform
- Details of parallelism are to be contained in a library of High Order Functions, taking functions or lambdas.(#ifdef..)
- Code spread equally throughout the entire sourcebase should be offload able to any processor/coprocessor , at statement level granularity (lambdas)


e.g. given code which starts life as loops, you can pretty much just change loop bodies into lambdas, and change the loop into a high-order function.

Is there currently any way to do something like this with the epiphany toolchain - From what I can see so far, reading examples, it seems you need to create separate ELFs for tasks that will run on epiphany cores, and refer to them by a string to spawn them, and write code in a makefile to compile them.

Thats' understandable, given its' a different ISA , the best starting point.

This is how things worked on the PS3, and its' why the CELL architecture was ditched for PS4: the extra software complexity meant you'd just lead your project on the rival SMP platforms, then port.
In the end the industry basically said "NO MORE!" and sony gave up.. which is a shame, because the idea is fundamentally sound, IMO.

If this isn't yet possible - any ideas on what it would take to make it possible?

options..

[1] One (maybe extreme) idea I had was a JIT code-morphing runtime library that could take any ARM host function pointer, and generate epiphany machine code from it. I realise that might be a rather large task.. however, you're going RISC to RISC, without SIMD it might not be so bad. a jit could empirically determine which loads are local or not.

[2] Perhaps there'd be another way to achieve this with a clang AST plugin. (scan the AST , find functions that are going to be spawned in parallel, spit out the thread main source code, ...

[3] Perhaps it would be possible to compile LLVM bit-code, and use an extended LLVM JIT with ability to target either host or epiphany cores (but how would function pointers work?)

Is anyone in the community interested in something similar?


Example SMP/C++ code..

Code: Select all
// useage...  apply the given lambda function to each element of a collection,
// generating a new collection. do in parallel across worker threads..

    auto result=par_map(my_collection, [](auto&x){... code to generate an output element.....; return ...})

// this is extremely convenient to use; the complexity is hidden.
// other high order functions can be written, e.g. map_reduce(...) etc.
// hence huge amounts of code can be easily paralleled for little programmer effort, almost as easy as writing 'for' loops.

// Implementation- something like this...  'par_map' is a templated high order function and manages spawning threads
// details may vary, e.g. a better implementation may rely on a system of thread pools

template<typename T,class F, typename U=T>
auto par_map(               // 'for each element in src, generate a new element  by applying the given function'.
      vector<T>& src,
      const F& f,
      int batch_size=1,
      int num_threads=4)->vector<U> {
   // templated object managing worker threads..
   struct ThreadStuff {
      vector<T>* input;
      vector<U>* output;
      int         next_index;
      int         num_active=0;
      int         batch_size;
      pthread_mutex_t mutex;
      const F* f;
      int end_index=0;
      // TODO - does this inline, maybe it could be a lambda?
      // TODO: should be a minimum number per thread too
      static void* thread_main(void* arg){   // this is the function that you'd want to be the 'main' in the epiphany e-core task...
      ...
                etc etc
           
         // spawn worker threads which arbitrate via this manager to step through the collection, invoking the given lambda
         // return the new collection



EDIT - it seems there is OpenMP which can do some of what I'm after, but I'm after the ability to express arbitrarily complex data-flow within the abstractions to leverage inter-localstore transfers.
Last edited by dobkeratops on Tue Aug 18, 2015 12:51 am, edited 15 times in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby theover » Sun Aug 09, 2015 11:36 am

The technical matters and your grasp of them might be a bit in the way here. A C++ compiler would be a thing to have first. The whole intercommunication with the Epiphany cores needs to be created and put in the compiler, and I think there's only C, so the class code would have to be ported, too.

Nothing against making a template, you could do it yourself, but whether you are willing to write a compiler for it ?

T.
theover
 
Posts: 181
Joined: Mon Dec 17, 2012 4:50 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Sun Aug 09, 2015 11:56 am

The technical matters and your grasp of them might be a bit in the way here.


I've dealt with a very similar architecture, the PS3 cell processor, at the time I used it the situation was very similar to what I see in the samples, the extra boilerplate was needed, and this is what killed the chip :)

What I'm talking about was very nearly possible there - the same code path compiling on both architectures , with ifdefs, but you still had to write some of the boilerplate manually , for limitations in what could be automatically generated (this trivial stuff, but still irritating and anything that must be manually maintained is a problem, because its' error-prone when code is changed.).

The development time was significantly higher so everyone led on the PC/xbox360 with a PS3 port being an afterthought.

However, since then time has advanced - (i) CELL itself has been out for longer, (ii) and there's 'clang' with an AST library for C++, and (iii) lambda functions since C++11... I'd have killed for those back in 2006,2007 :)

The whole intercommunication with the Epiphany cores needs to be created and put in the compiler,


Not necaserily, and that's not what I had in mind.

The implementation of the intercommunication could be done in source code - within the templated high order function definitions - keeping that out of your main algorithms.

A few ifdefs within abstractions definitions would let you write the same code for SMP and Epiphany.

Code: Select all
par_foreach(..){
#ifdef ... // SMP
   setup thread management objects
   spawn threads with their 'thread_func objects'
   helper code within ThreadStuff method manages stepping each worker thread through the task que -
#endif
#ifdef PLATFORM_PARALLELLA
    allocate cores , load code onto cores, dma data into them...
    templated helper code wraps the given function a 'main' handling the intercore communication..
    ** the key enabler is the ability to take that one templated function and generate an epiphany object code from it **
#endif
}

Of course you have to be aware of local stores, but you'd handle this with a nice library of high-order functions.
e.g. par_map() would take a function that simply takes an input struct and returns an output struct.

The templated boiler plate would handle the DMA transfers and inter-core communication for several cores stepping through different elements of a collection.

A C++ compiler would be a thing to have first.


I had assumed generating epiphany machine code would be done in the back end of gcc which would mean it could be driven by C or C++ equally; you could use C++ on CELL no problem . Let me check if that is indeed the case here.

It can understand why the samples might stick to C because C++ can tend to cause disagreement in how to use it, it can tend to be 'write-only'.

Perhaps someone out there might have tried something similar for other architectures, and parts of this may be available already.


I think this is a fairly common approach for parallelism in C++ - so there must be other people who want to work this way: collaboration should be possible.
Last edited by dobkeratops on Sun Aug 09, 2015 1:21 pm, edited 4 times in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Sun Aug 09, 2015 12:07 pm

theover wrote:but whether you are willing to write a compiler for it ?
T.


I also have a compiler for a 'pet language' who's syntax is similar to Rust; I have considered doing this for that - however I doubt I could convince others to use my own 'pet-language'. I pretty much gave up on it once that sunk in. But large parts of this language do work.

https://github.com/dobkeratops/compiler

The design of this language is motivated by this use case, i.e. my own disatisfaction with C++ for parallel code (experienced gained on the PS3/360 generation of consoles), and taking inspiration from Rust where its' superior (immutable default & better syntax for functional code), and my disatisfaction with elements of Rust (too rigid, I think its' imposition of safety would make writing DMA based code quite a bit harder, and they diverge from C++ further).

C++ with lambdas is sufficient though, its' improved a lot since it got polymorphic lambdas.

I think a tool for C++ would gain way more community traction.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Sun Aug 09, 2015 2:40 pm

EugeneZ wrote:
dobkeratops wrote:I doubt I could convince others to use my own 'pet-language'.

Yeah, especially when programmers get used to UE4 scripts : D



those graphical tools are nice and have their uses but I think there's a good reason we don't use them all over the place already.

I don't think people will drop C++ for that.

nonetheless one motivation for a new language (over C++) is the ease of parsing , it would be nice if source could be edited both graphically & through text...

The issue with anyones' language is everyone tends to diverge on what they want, when moving from C++. Rust had most of what I wanted but the last 5% ruined it for me, and I thought they made it too incompatible with C++ for reasons I felt weren't important enough. Nonetheless the other 95% was so inspiring that it was worth copying :)

OOP syntax , and some functional language syntax (e.g. |> in F# ) do already makes it quite easy to write pipeline flows.

One feature I implemented inmy language was 'UFCS' (from D) which allows free-functions to be chained. foo().bar().baz() ... in C++ you can only do that with methods inside classes. This is a serious problem for coupling - classes need to know too much about each other. Free functions are superior for decoupling code into independent modules.

I don't use D because its' based on garbage-collection, and isn't as good as Rust in other ways. I prefer the concept of single-ownership RAII based memory management, with no garbage-collection runtime required.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby piotr5 » Mon Aug 10, 2015 10:42 am

I'm not really convinced of your approach. if epiphany were a fpu, then compiler could treat it as extended assembler-syntax. but epiphany can actually execute sequential code and has no hardware-support to put such code into sequence. better imho is software-emulation for task-switching on epiphany. i.e. on arm you program in a way as if epiphany didn't exist, load and run programs on epiphany as usual. that's good because it reflects the slow connection between the two processors. then on epiphany do as you suggest, assign some free core at runtime or compiletime to execute the task you want, copy over code to the core's local memory or let it run instruction by instruction from the current core. only problem is code-relocation when multiple tasks are running. basically that's 2 approaches: runtime or compiletime. when decisions are at runtime, this allows to run 2 programs on your epiphany. if compiletime you must own the whole epiphany chip...
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Mon Aug 10, 2015 12:25 pm

I'm not really convinced of your approach.


let me try to explain it better, and the context (past and future)

if epiphany were a fpu, then compiler could treat it as extended assembler-syntax.


Thats' definitely *not* what I have in mind; its' medium to large size functions that you want running on it: lambdas can of course call other functions - you'd want to assemble a ~16k block of code from the call-graph from the specific lambda. and of course you can still plug in actual function calls into those lambdas. Lambdas just make it easier.
If its just numeric acceleration.. we already have SIMD, GPGPU.
'par_map()' is about using several worker threads to run different iterations of a complex loop on different cores.

What would be *perfect* is if you could also automatically generate pipelines from functions that are otherwise too large.

This is something we frequently did manually, e.g. 'update' some complex object might be thousands of lines, logically one function or loop body, but empircally it would be found to be too large for the i-cache; so it would be broken up into multiple stages. Logically it would still be one function, and one 'pass' through the data, but what you could do on the epiphany is allocate several adjacent cores to run different stages.
then you'd get 2-dimensional parallelism.. e.g. rows doing 'different elements' and columns doing 'different pipeline stages of one element'.

Now imagine 'map-reduce' but with pipelined map & reduce stages..

What i'd do right now though is just write additional versions of 'par_map()' or 'par_update()' that just happen to have several functions to apply in sequence. then twiddle some parameters for granularity (empirically optimize without having to butcher source..)

my mental model comes from the way we used PS3 CELL. the cores would run large tasks, like "calculate the joints for a character, blending all its' animation cycles together.". (on epiphany you might further split that, but the stages would still be quite complex.. like decoding animation cycle data, or applying constraints to a skeleton)

but epiphany can actually execute sequential code and has no hardware-support to put such code into sequence.

.. yes; as above thats' why I want to be able to use it similar to the way multicore is used.
That's the point of 'par_foreach" and similar abstractions. 'run the update function for each character..' etc.

I think there will be a sliding scale, between current multicore, manycore, and GPU approaches. Some of the massively threaded multicore chips and NUMA systems might be closer to the manycore approach.

better imho is software-emulation for task-switching on epiphany. i.e. on arm you program in a way as if epiphany didn't exist,
load and run programs on epiphany as usual.


This is exactly what limited CELLs' usefulness (and epiphany's , with the current toolchain), which is why PS4 ditched it.

You have sourcebases of >=100klocs of code developped economically to run on many platforms, millions of devices.

If you have to put a lot of effort into writing special code for one unit .. its' not going to get the same refinement, its' going to be stillborn.

You'll want literally hundreds, even thousands of tasks. having to manually manage separate elf, even naming them becomes a burden. What should logically be simple function calls (or function passing) & function definitions becomes obfuscated with noise. Moving code around becomes a chore.
The only useful information would be a simple hint , but even that should be determined by the compiler (e.g: measuring the size of the functions reachable).

Some makefile magic could handle it but its just another barrier and makes your source way harder to comprehend.

if it's a special case, focussed application, it probably has an ASIC already (like video, or 3D rendering), or GPGPU's might handle it fine, or an FPGA might work (e.g. neural nets, and I read about microsoft using them to accelerate search engines)

that's good because it reflects the slow connection between the two processors.


I'm assuming this is a a property of the board really being a 'prototype'/'dev-kit' ... the intent was something to inspire experimentation.

Its' still connected to DDR which I call 'main-memory';
ARM and Epiphany should be 'equal players' for consuming, producing, & traversing large datasets in DDR.

I'm assuming future versions will have a faster connection to DDR (I read RPI has 3.5gb/sec memory bandwidth, the epiphany has 1.5gb/sec but i do realise it has 2 extra ports, 4 total. I wondered if they could have used both the horizontal connections and placed some of the DDR' address space 'left' and 'right' of it..and with the current setup if you stack boards you increase the DRAM bandwidth).

Part of what I see in their literature is an ambition to integrate epiphany cores in future SOC's, which would be great IMO and closer to the CELL setup.

In sony CELL, there was an L2 cache shared between a PowerPC 'main processor'('big core') and SPU(like epiphany cores) auxiliary 'little' cores - fast communication between them without consuming DRAM bandwidth, staying on-chip. That's where the epiphany belongs, IMO.

I have wondered if the FPGA's memory could be used in a similar manner: provide more buffering between the ARM and Epiphany for temporaries without hitting DDR. that could be another scratchpad, (just like there's L1,L2 caches perhaps we could have the concept of a L2 scratchpad..).The PS2 also had something similar - a main CPU, a scratchpad, main-memory, a GPU, and vector-units (DSP's with their own scratchpads); the CPU's scratchpad was a middle-ground where temporary data could be placed for moving between units without consuming DRAM bandwidth



What would be perfect is if all cores in the system used the same ISA, just fewer of them optimised for larger ,low-latency tasks and more of them optimised for smaller, high-throughput tasks.

However we are where we are, with OS's running on established ISA's for the former.. so it seems dealing with mixed ISA's might be here to stay (as we're already doing with CPU & GPU). I think they did the right thing pairing up the parallella with an ARM.

One benefit of allowing the same code to run on both is your 'host' cpu might spawn a large workload on the cores, and otherwise be waiting for the result - it can pick up part of the workload.

I've actually done this myself on CELL, and I've heard of others doing it too.

One option is - if you can estimate the cost of tasks -get the low-latency cores picking up the larger chunks.

then on epiphany do as you suggest, assign some free core at runtime or compiletime to execute the task you want, copy over code to the core's local memory or let it run instruction by instruction from the current core. only problem is code-relocation when multiple tasks are running. basically that's 2 approaches: runtime or compiletime. when decisions are at runtime, this allows to run 2 programs on your epiphany. if compiletime you must own the whole epiphany chip...


The thing to remember is:

for simple tasks we've already got GPUs, FPGAs, ASICs.

The 'uncharted territory' is applying massive parallelism to complex sourcebases - programs that are 100kloc & up. These have to be portable, and if you have to do too much special work for one platform, it will be neglected. You can't start out writing a new major application for one architecture when there's a constant churn of devices in use.. you want the cost of your software development to be spread across as many potential users as possible.

That's the territory of game engines. CELL was a good hardware design, IMO, but with the tools weren't there so developpers always lead on the xbox360, sony learnt their lesson and ditched the chip.

The epiphany has a major step forward: the ability to directly load & store off-chip & inter-core; I think that should make moving code onto it easier.


What would be absolutely perfect is if the compiler could emit relocatable code for both core types, and a mapping table such that you can get the code for another core type from the previous one, and a library for assembling a blob for the epiphany core, given one particular function as a root.

The long and short of all this is that you'd be able to use the same abstractions that let you run multicore programs now, and suddenly you'd have huge amounts of applications able to run on the epiphany architecture, more potential customers and the market to mass produce the tasty 64core, 256 core etc variants.

From what I know of optimising code for conventional CPUs (where cache coherency and memory latencies matter)... and what I saw of CELL - I do believe with the appropriate tools, the epiphany should be able to handle large sourcebases, and not just be an isolated, niche accelarator. (and thinking about it, if you have hundreds or thousands of cores, you could literally have hundreds-thousands of separate functions in flight at once).

I also believe it should be possible to have a single chip that can handle large sourcebases, AND have high computational throughput - whereas at the moment we must choose between one or the other capability. (CPU/GPU)


There's a chicken/egg situation with several components here. Ideally I would like to see a few features actually added to C/C++ to make this easier.- transitive const, a pragma for 'restrict' being default for const pointers; a write-only pointer. Then who knows, in future mainstream CPU's might end up getting DMA controllers, and/or hints for separate cores that don't need coherency checks. (I have already seen hybrids, cached cpu's with small amounts of scratchpad, and/or cache-control instructions sufficient to make a cache work like one)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby piotr5 » Tue Aug 11, 2015 12:55 pm

dobkeratops wrote:You have sourcebases of >=100klocs of code developped economically to run on many platforms, millions of devices.

If you have to put a lot of effort into writing special code for one unit .. its' not going to get the same refinement, its' going to be stillborn.

You'll want literally hundreds, even thousands of tasks. having to manually manage separate elf, even naming them becomes a burden. What should logically be simple function calls (or function passing) & function definitions becomes obfuscated with noise. Moving code around becomes a chore.
The only useful information would be a simple hint , but even that should be determined by the compiler (e.g: measuring the size of the functions reachable).

I disagree here. what you are talking of here is sloppy program-design.
best to use the DDD paradigm: 4 layers, infrastructure and application and ui and the actual domain. when programming the domain-layer you just need to execute the code on epiphany alone, no support from arm. the arm will be busy with gui and disc-management anyway. that way porting to another platform is just a matter of replacing a call by load and run of an epiphany program.

take for example a graphics card: in past you had the card performing simple tasks like drawing triangles and putting texture on them. now you can program them on chip, you can alter vertex-data and change textures on the fly. all that is done inside of gpu, not in cpu. this is a clear separation between the 3d-graphics domain and the rest of the program. nobody actually tries to perform part of the vertex-movements and texture-changes in cpu anymore! so why make such a mistake on parallella? even if you found a way to make such a mixture portable and gain some speed, you'd introduce a lot more opportunity for bugs and in essence the programs would become less comprehensible because of the complexity hidden in task-assignment...

the idea with dual-compile everything is good though. in linux is is common to have 32bit and 64bit versions of libs in parallell, so you can choose at runtime which to use. this would be great if someone could implement it on amd's kaveri and godvari, so part of the OS is running on the 32 cores in GPU, thereby making better use of the hardware available. i.e. do a hybrid system, with task-switching done by converting data from one assembler-language to the next, and just run the corresponding functions. i.e. in data all callbacks would need to be replaced and data-alignment fulfilled. then if it works on the more prominent platform (amd) port it to parallella, because it works here too (although less efficiently since there are 2 chips instead of one). this way you have a strong argument telling people "hey, come over, parallella is just like AMD's APU but open platform and consumes 1/10 of the power at the same frequency!" as it is now, AMD's APU isn't popular because of its single-thread speed being about the same as intel's, maybe slower. so if you want to make parallell programming more popular, your code must also run on AMD's approach and push its popularity that way...
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Tue Aug 11, 2015 10:15 pm

I disagree here. what you are talking of here is sloppy program-design.

no. Its' just reality.
As computers get faster, the complexity of what we want them to do explodes too.
programs get bigger, there are more cases to handle.


best to use the DDD paradigm: 4 layers, infrastructure and application and ui and the actual domain. when programming the domain-layer you just need to execute the code on epiphany alone, no support from arm. the arm will be busy with gui and disc-management anyway. that way porting to another platform is just a matter of replacing a call by load and run of an epiphany program.
take for example a graphics card: in past you had the card performing simple tasks like drawing triangles and putting texture on them. now you can program them on chip, you can alter vertex-data and change textures on the fly. all that is done inside of gpu, not in cpu. this is a clear separation between the 3d-graphics domain and the rest of the program.

Game engines still have literally 100's of thousands of lines of code managing what the GPU does - and its' all performance critical - because a games machine has a weak CPU to devote more hardware budget to the GPU: you can't have the GPU stalling waiting for the CPU.

nobody actually tries to perform part of the vertex-movements and texture-changes in cpu anymore! so why make such a mistake on parallella?

you're under-estimating the computational load of the rest - and under-estimating the volume of performance critical code.

EDIT: and historically CELL on the PS3 was actually fast&efficient enough for vertex work, and versatile enough for CPU work - this a new domain that Epiphany could be aiming for instead of being a weak CPU, weak GPGPU


this debate is 10 years old already - sony have done the experiment with the CELL - given a highly similar chip to a profitable industry - and it failed... due to the language not being there.

The community relies on C++ to write complex applications, and this just didn't map well to it.

No game developer could afford to put a few years into designing a new functional language for it.

Lambdas,auto make it a lot easier - but these arrived in C++ *far* too late for the PS3. But we have them now, and can use them to parallels large sourcebases, at fine grain.

we had complex sourcebases with 100kLOCs of code and we already needed to leverage multiple cores to run all of it - then we moved to PS3 and had to struggle splitting off separate ELFs and building DMA.. the development time is at least 2X - so no one leads on that platform; xbox360 became dominant.

So sony learnt their lesson and moved to a conventional design.

What doesn't work is, assuming there's a 'small amount of code' thats critical to offload - or even assuming you can plan up-front which code needs to go over. A lot of optimisation is purely empirical.

What we need is parallelism across the board.

We used to have this '80:20' rule. But as you state, the GPU handles the 20%. whats left on the CPU is a *huge* amount of code needing to run across the whole processor.

Modern CPU's spend silicon and energy trying to extract the parallelism from a single thread (register renaming,complex OOOE); thats' because we're in a transition where our tools are designed for single-threaded code. We need to move more of our software over to parallel. more of the parallelism needs to be stated upfront.


even if you found a way to make such a mixture portable and gain some speed, you'd introduce a lot more opportunity for bugs and in essence the programs would become less comprehensible because of the complexity hidden in task-assignment...


Thats' life. we ask computers to do complex things. Bugs will happen.

the idea with dual-compile everything is good though. in linux is is common to have 32bit and 64bit versions of libs in parallell, so you can choose at runtime which to use. this would be great if someone could implement it on amd's kaveri and godvari, so part of the OS is running on the 32 cores in GPU, thereby making better use of the hardware available. i.e. do a hybrid system, with task-switching done by converting data from one assembler-language to the next, and just run the corresponding functions. i.e. in data all callbacks would need to be replaced and data-alignment fulfilled. then if it works on the more prominent platform (amd) port it to parallella, because it works here too (although less efficiently since there are 2 chips instead of one). this way you have a strong argument telling people "hey, come over, parallella is just like AMD's APU but open platform and consumes 1/10 of the power at the same frequency!" as it is now, AMD's APU isn't popular because of its single-thread speed being about the same as intel's, maybe slower. so if you want to make parallell programming more popular, your code must also run on AMD's approach and push its popularity that way...


AMD's APU is already mainstream for games: a variant is used in both consoles (xbox-one and PS4).
Gamedev has been all parallel for nearly 10 years now.

The best would be one instruction-set architecture, and 'big' and 'little' cores .. some optimised for low latency, some optimised for high throughput, and the ability for tasks to migrate between them.

Some sort of JIT might have virtues? i.e. being able to dynamically compose functions into scratchpad chunks

Right now the awkwardness is: the spit between the two ISA's has to be decided on a file by file basis.
As I've explained this is just too much overhead in program complexity. To write parallel code effectively you need to leverage C++ templates to generate convenient high-order functions. On conventional machines compilers also have the ability to leverage SIMD for very fine grain parallelism - parallella is trying to replace SIMD with more cores which means we need very fine grain task spawning to compete.
Gamedev get's to exploit 4way SIMD pretty much all the way through with vector maths library.. a single register holding x,y,z is the most common approach (even if that's not the best use of SIMD, its' still useful enough to use) - thats' been mainstream for ~15 years - and newer SIMD architectures make it easier to autoparallelize. (with gather instructions etc)

Every 'for' loop is a potential set of tasks, every big function or loop body is a potential pipeline. You cannot expect programmers to manually faff around with separate ELFs at that granularity.. but on an intel chip , the machine is figuring that out for you - and delivering - with very complex OOOE.

if you have to do it manually you simply wont get your software written as fast as competitors on other platforms.. and we'll stick with this CPU/GPGPU split forever.

it needs an advanced compiler.

I should point out there was another attempt to do this kind of hardware on the PC - ageia phys-X chip - it was another CELL/Parallela-esque design (but they never opened up the ISA, they intended you'd use it through one physics library). Same thing happened.. it was squeezed out by multicore x SIMD improvements on one side, and GPU's getting more general on the other.
Last edited by dobkeratops on Fri Aug 14, 2015 1:59 pm, edited 1 time in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Wed Aug 12, 2015 3:05 am

lets explain this another way

This is what the parallelism in games already looks like - its '2 dimensional'. And i'm just talking about the CPU here, not the GPU.

Code: Select all
                          main
                     /    |    |   \                  coarse grain: threads
                  core  core  core  core ..      per major tasks e.g. 'each character'
                 /||\   /||\  /||\  /||\
                    SIMD&Pipelines ..             fine grain(loop bodies, e.g. 'each joint')
                                                        10s-100's of concurrent operations
                                                          in CPU existing code.


take the xbox360 for example: it has 3 physical cores x 12deep pipeline x 4way SIMD (and dot-product instruction allowing x,y,z in a register to work well)
it was in-order, so loops were unrolled to achieve this. Code&data had to be re-worked for loops to gurantee independence.
you could also hyper threading so you could count 6 threads x effective 6deep pipeline
(of course on better CPU's with deep OOOE, you don't need to unroll and the CPU finds the parallelism between successive loop iterations for you)

What hit sony badly is that at the 1st level ('across cores') this code was already so big and complex that it made mapping to CELL SPU's with DMA and scratchpad an absolute nightmare. And this was both computationally *AND* logically complex. e.g. AI for characters figuring out responses using retraces & routing.. (interacting with large data-sets e.g. AI navigation meshes, collision mesh, skeleton matrices, animation cycles). data-driven animation skeletons with complex constraint systems..

You were suddenly constrained from 3 cores back to 1 'PowerPC' (PPU) - and thats' why PS4 ditched the approach.

we all knew the scratchpad is a 'software-managed cache'- properly programmed it matched a CPU's ability to deal with large,complex data-sets. It was extremely potent for raytracing for example - but the mistake was assuming 'its' just some key parts that need to be sped up'... no... we already had huge source bases that needed to be spread across the whole processor.

on the current console generation you have 8 cores, again SIMD, and of course the more general GPGPU ... all sharing memory. code is paralleled and then when it *still* isn't enough, moved to GPGPU for assist, but of course you don't do *everthing* as GPGPU because it doesn't handle complex code well.

For something like the Epiphany or Parallel to have a place, it would be nice to solve that problem of getting *both* complexity and throughput - it has a shot at that, being MIMD - it is theoretically capable.

(I view the parallella board as a prototype, 1.5gb/sec to main memory cripples it.. i'm pretty sure a modern ARM with 4cores x 4SIMD with caches allowing use of a large,complex datastructures would outperform it for practical game engine code - I realise thats' probably a tradeoff to make a more versatile board, I suppose they needed to keep GPIO pins available on the FPGA.. more useful for embedded tasks, I know it's not a games machine

but I'm still interested in the architecture - I still think it has potential. I would personally prefer a manycore machine to what we have in the mainstream today

I realise the chip itself could get 6gb/sec if all its' links were connected to memory)
Last edited by dobkeratops on Sun Aug 23, 2015 2:49 pm, edited 3 times in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Next

Return to Programming Q & A

Who is online

Users browsing this forum: Google [Bot] and 12 guests