generating epiphany code from templates, possible?

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Re: generating epiphany code from templates, possible?

Postby piotr5 » Wed Aug 12, 2015 8:35 am

dobkeratops wrote:AMD's APU is already mainstream for games: a variant is used in both consoles (xbox-one and PS4).
Gamedev has been all parallel for nearly 10 years now.

The best would be one instruction-set architecture, and 'big' and 'little' cores .. some optimised for low latency, some optimised for high throughput, and the ability for tasks to migrate between them.

it seems you're missing the point. we're talking of heterogenous system architecture, 2 machine-languages sharing a single memory. examples for this kind of architecture are parallella, amd's kaveri and godvari, and ODROID XU3 with its samsung-processor. all 3 are available below 200$ and offer about the same parallell-performance. (more or less, in terms of magnitude. amd offers 4x4Ghz plus 8x4x<1Ghz wavefronts with 16 simd-cores each -- you could see this as 512 simd or as 32 mimd, I prefer the latter and call them high bit-width cores. odroid-xu3 has 4x2ghz and 4x1.4ghz both programmed with the same machine-language. parallella has 2 ARM cores and 16 slower cores.) i.e. you don't need to actually upload the program, a lot of complexity is lifted in comparison to the 10 years old combination between cpu and gpgpu. therefore you can use the same sourcecode for both architectures if you take care that shared data is actually compatible. no jit, you need to make a decision for which architecture your code will be compiled though, and you need to split sourcecode in a way that a single file is compiled for a single machine-language. if your program was well-designed, this is no problem. parallellizing loops that way makes no sense though, the administrative overhead and program-complexity doesn't make shared computation worth the hassle. parallell loops is a very cheap shot at automatic parallellization. easy to implement, but rarely useful. more useful is an approach encompassing loop-parallellization too: investigate what data-processing is independent of eachother and do these in parallell. here the risk is to overlook some hidden independence, detection of which is quite a new and complicated problem for advanced algebra. so, till we teach computer to do algebra for us, we need to optimize and parallellize by hand, in addition to some AI. keep in mind, all this is just about mimd, simd can be happy with mere loop-parallellization, although I've heard also simd compilers seek possible similar instructions to be parallellized, although that rarely happens. isn't that how mmx and sse optimization works?

as has been mentioned already, complexity comes from all the little peculiarities of the various systems. epiphany requires the program to be loaded into local memory, amd works best if you make use of the cache, and so on. traditionally the compiler is responsible for these details, but if the compiler doesn't do its work the sourcecode becomes more complicated with all the exceptional stuff. obviously compiler-developers are overburdened with optimizing for the various possible architectures. imho a program should be composed of 2 layers: the sourcecode expressing intention of the programmer, and some sort of plugin for the compiler to help implementing these intentions in the various architectures. templates are a possibility for doing these 2 layers.
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Wed Aug 12, 2015 11:30 am

piotr5 wrote:it seems you're missing the point. we're talking of heterogenous system architecture, 2 machine-languages sharing a single memory.


The point that you're missing is expecting that you can always plan up front which code should end up on which processor.
it doesn't work that way. and you are under-estimating the complexity of code that needs parallelising. Thats' probably because you're used to complex CPU's which parallelize for you at runtime.

optimisation is an empirical process. You can make a reasonable guess, but you don't actually know where the bottlenecks are until you actually measure.

epiphany or Cell SPU is not like GPGPU: it's fully capable of running the same complexity of code as a CPU. And thats' the point of having it, vs the existing solutions of CPU, GPGPU, FPGA.

examples for this kind of architecture are parallella, amd's kaveri and godvari, and ODROID XU3 with its samsung-processor.

The closest example is Sony Toshiba IBM CELL . it has more in common with that than any of the other examples you've posted. I am speaking with experience on this architecture.
but epiphany has got further enhancements i.e. the ability to generate off-chip transactions directly from loads & stores that makes it more feasible to do a CPU's job.


One thing Sony explicitely had to go around telling people is 'don't think of them as accelerators - they are your main processors'. Coming from the PS2 we had something that was more restricted. (DSP's basically, "'VU0, VU1"), and that was a mindset they needed to break.

all 3 are available below 200$ and offer about the same parallell-performance. (more or less, in terms of magnitude. amd offers 4x4Ghz plus 8x4x<1Ghz wavefronts with 16 simd-cores each -- you could see this as 512 simd or as 32 mimd, I prefer the latter and call them high bit-width cores. odroid-xu3 has 4x2ghz and 4x1.4ghz both programmed with the same machine-language. parallella has 2 ARM cores and 16 slower cores.) i.e. you don't need to actually upload the program, a lot of complexity is lifted in comparison to the 10 years old combination between cpu and gpgpu. therefore you can use the same sourcecode for both architectures if you take care that shared data is actually compatible. no jit,


The thing that GPGPU can't do, which CELL and CPU can, (and which a combination of ARM+epiphany should also be able to do) is the *nested parallism* - multiple levels of fork & join - a bigger outer task (itself parallel) with inner loops paralleled. The tasks can't interact at the right granularity.

They're not going to carve out new space if developers restricting their thinking to what already happens on CPU+GPGPU, CPU+SIMD, CPU+FPGA. And we need to solve the problem that defeated Sony's attempt.
I firmly believe there IS a niche there - the same one Sony went for. The fact we use GPGPU for it now restricts its' potential, I view it as an evolutionary mistake. You have to choose between complexity and throughput, straddling that boundary is awkward.

This is how innovation works - you look for new opportunities - that which looks crazy now, because it hasn't yet been done. All our technology comes from this. It used to be crazy to claim that people can fly, or that a high level language can be as fast as assembler. I remember when Assembly programmers didn't take C seriously (i was one :) ), and ditto, when C programmers didn't take C++ seriously..

You spot the opportunity, fill a new space, and the paradigm shifts.

This is what adapteva need to succeed with this.

I do believe their hardware is a good idea.

For yet another take, see what intel are doing with 'knights landing' - merging complex cores & throughput capability. its' on their roadmap..

You mentioned previously that "no one would do vertex transformations on the CPU"
Guess what.

Sony CELL did actually have sufficient throughput to do vertex transformations - AND was sufficiently versatile to handle any task the CPU could do (because it could step through large datasets).

and some of the shoe-horning to encourage more use, sony did actually end up supplying libraries for this: vertex shading , and 'backface culling' (a speedup was possible by getting SPU's to transform vertices , assemble primitives and do the backface cull ahead of the GPU - then you re-wrote an index list, fed *that* to the GPU so that it would only do the full vertex shader for vertices that touched forward facing primitives: a modern GPU makes some compromises over whats possible to shoe-horn rendering into a simplified paradigm.)

it was even sometimes used for pixel work , e.g last stages of deferred lighting.
and yet its' main role was running the same code that was done on an SMP CPU on the rival platform.

This is the kind of space you'd be back in if adapteva succeed - throughput AND greater complexity - wheras with existing options you must decide between either extreme.

When I look at adapteva epiphany I see "version 2.0' of an experiment Sony performed - we have experience of what they were trying to achieve and what went wrong.

During the development of CELL, it was originally intended to be the graphics chip, an evolution of how PS2 worked (where the CPU had one DSP to help, and where vertex transformations were done by a second DSP.. there were going to be 2 CELL chips.. ). In the end they couldn't compete for pixel operations with GPU's (raster & texture sampling)- and throwing in a GPU with vertex shaders led to an awkward hardware setup that restricted its' potential. But the horsepower was definitely there for vertex transformations.

GPUs have subsequently gained extra complexity to handle more techniques (geometry shaders) but CELL was already capable of those tasks.. advanced tessellation etc.. more scope on how to encode & decode vertex data.. more scope to use information between vertices..

you need to make a decision for which architecture your code will be compiled though, and you need to split sourcecode

THAT is the part that prevented this architecture from reaching its' full potential - this is the lesson we learned between the gamedev community & Sony over the past 10 years.
If you have to manually split - you can't handle the full range of complexity and fine grain interactions that its' capable of - and the other options (SIMD, CPU/GPGPU) are superior. the development cost becomes prohibitive.

in a way that a single file is compiled for a single machine-language. if your program was well-designed, this is no problem. parallellizing loops that way makes no sense though, the administrative overhead and program-complexity doesn't make shared computation worth the hassle.
parallell loops is a very cheap shot at automatic parallellization. easy to implement, but rarely useful.

the'yre *nested*. fine and coarse grain ... 2 levels of fork and join. And it DOES work - this is how game engine code works already. It does take effort reworking data structure... but it's proven.
on a complex OOOE CPU the 'automatic parallelization' is done by hardware, you take it for granted. On the Xbox360 we had to do that manually with loop unrolling, which made it very visible. The plus side was, it encouraged us to rework data to maximise the amount of fine grain parallelism possible.

For Adapteva to match what intel do, you will need to make that parallelism explicit, the same as on the xbox360, and leverage inter-core transactions to handle the fine grain interactions.. minimizing the overhead of starting & stopping the group of parallel tasks - the difference between making deliveries by many cars or by few 18wheeler trucks, or 'something in the middle' (pickup trucks'). e.g. imagine trying to make door to door deliveries on the truck.

If you don't, you will waste the full potential of MIMD to do what GPGPU can't.

The opportunity to exceed intel is that handling it at compile time saves runtime cost. (heat,transistors). (and intel have spotted a similar opportunity which is why they're doing 'Knights Landing')

more useful is an approach encompassing loop-parallellization too: investigate what data-processing is independent of eachother and do these in parallell. here the risk is to overlook some hidden independence, detection of which is quite a new and complicated problem for advanced algebra. so, till we teach computer to do algebra for us, we need to optimize and parallellize by hand, in addition to some AI.

yes, and the established pattern for this is high-order-functions .. 'map_reduce' is actually a famous example. I'm talking about writing pretty much all your source this way. Wherever you have an iteration, you write it as a high order function taking a lambda, then you can toggle a simple switch _par version, and you can empirically establish the best granularity.

I have looked with great interest at the world of functional programming. Its' ruined by garbage collection, but I am convinced lambdas and internal iterators are the way to go.

keep in mind, all this is just about mimd, simd can be happy with mere loop-parallellization, although I've heard also simd compilers seek possible similar instructions to be parallellized, although that rarely happens. isn't that how mmx and sse optimization works?

well in Gamedev its very common to use SIMD for the 'Vector3','Vector4' datatypes - most of the calculations (at every level) use 3d vector maths - its' ubiquious - but the newer architectures are more suited to what you describe - with wider simd (8-way); and 'gather' instructions - making it easier to apply SIMD to general problems.
And it was actually possible to work like that on Sony CELL in a very convoluted way (again, really needing fancy compiler magic that never appeared - you could load independent structure data into registers ,and permute into SIMD lanes, work on that , then permute back and store to independent structures again)

as has been mentioned already, complexity comes from all the little peculiarities of the various systems. epiphany requires the program to be loaded into local memory,


loading into local memory is the same as a cache except: instead of the machine figuring it out at runtime, you have the opportunity(&responsibility) to use compile time information upfront.
in gamedev we had machines with in-order processors, which meant they grind to a halt with a cache miss: which meant we needed to manually put prefetch instructions in for optimum performance. Moving from Intel to xbox360 was an education in what Intel is doing behind the scenes i.e. by taking it away :)

What happened was CELL and Xbox360 actually needed the same data-layout tweaks, its' just the actual coding for CELL was more convoluted - and you needed to do things slightly differently depending on whether data was in local or global memory.
Epiphany has a step forward that can simplify this: namely the ability to use the same address space for 'on-core' and 'out of core' addressing (i know in the current board that's restricted to a 32MB window, but a strong recommendation i would give is to maximize the amount of space that is addressed the same.

amd works best if you make use of the cache, and so on. traditionally the compiler is responsible for these details, but if the compiler doesn't do its work the sourcecode becomes more complicated with all the exceptional stuff. obviously compiler-developers are overburdened with optimizing for the various possible architectures.

imho a program should be composed of 2 layers: the sourcecode expressing intention of the programmer, and some sort of plugin for the compiler to help implementing these intentions in the various architectures. templates are a possibility for doing these 2 layers.


yes this makes perfect sense, now we're on the same page. Templates, macros can sort of do it but the analysis is whole program, and beyond what they can do, which is why we must do things manually at the moment.

One thing I did on CELL was a set of smart pointers that handled the localstore DMA. (e.g. given any pointer to main memory, you just assigned to some local temporary and dereferenced from that - and via constructor/destructor, it would do a DMA load , store on going out of scope. With some #ifdefs, the same code would compile for the xbox 360. (the smart pointer would just pass the original pointer across and the optimiser would just use the original value.)

This was a reasonable 'better than nothing' way of adapting code for portability - but it would be so much better if those could be swapped in by a compiler plugin, and done empirically i.e. from measurement (run the code on the CPU first with extra instrumentation, then decide the granularity of the switch)

An analogy I would use is registers,stack, machine code vs a modern compiler.
You used to have to manually manage which data goes on stack and in registers.
Then compilers advanced, and they could reliably automatically (a) allocate registers (b) track for you which variables move between stack & registers, when the registers overspill - even though from the perspective of a 1985 assembly language programmer, those would be chalk & cheese.

Think of 'fine' and 'coarse' grain parallelism in the same way. At the minute we do it manually, but a compiler should be able to take the same code and figure out where its to be done'... as you shuffle code around between loops and functions it might move how it best maps to the underlying hardware, even though its' logically the same.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Wed Aug 12, 2015 4:48 pm

more attempts at diagrams to explain the point...
Code: Select all
CPU - SMP x SIMD x DeepPipelines - high complexity, low throughput

typical pattern.. this is what mainstream game engine CPU code does in recent years
10's-100's (best case) of operations in parallel.
e.g. xbox 360 peak CPU performance required:
 3 cores x 12 pipeline stages x 4 SIMD lanes = 144 concurrent operations
i'm drawing this diagram as 4 x 4, but those numbers vary between platforms

                     main
                      |
                      *                       while (game is running..)
            /       /    \      \
       /         |        |         \
    core       core      core       core       outer loops eg "for each (character){" - threads on cores.
      |          |         |          |
      *          *         *          *        inner loops e.g. "  for each(character.joint){"
   / | | \    / | | \   / | | \    / | | \     pipeline & SIMD parallelism
   | | | |   |  | | |   | | | |    | | | |     complex desktop OOOE cpu manages deep pipeline
   \ | | /    \ | | /   | | | |    | | | |     in-order cpu eg xbox360 needs manual unrolling
      *          *      | | | |    \ | | /     new intel designs simplify compiler generated SIMD here
      |          |      \ | | /       *
      *          |         *          |        complex code between loops
   / | | \       |         *          |        needs rapid acess to results     
   | | | |       |      / | | \       *        data between stages should stay in caches
   | | | |       *      | | | |    / | | \
   \ | | /    / | | \   \ | | /    | | | |
      *       | | | |      *       \ | | /
      |       | | | |      |          *
    etc
      |          |      \ | | /       |
      |          |         *          |
       \         \         /         /
            \       \   /      /
                      *                        "}"
                      |                        next major set of tasks..
                      *                        "for each (spatial chunk..){"
             /      /   \      \
       /         /         \         \
      *          *         *          *        "  for each (object in chunk..){"
    / | | \    / | | \   / | | \    / | | \

major groups of tasks may also be run in parallel e.g. when core count was low, 'update and render' in parallel, or to leverage more cores where the 'outer tasks' dont have enough.
there might be background low-priority tasks to fill the gaps.


There's really 3 levels of parallelism - cores, SIMD *and* pipelines - but you
only really get to max out both in inner loops. Code 'in the middle' tends to waste one or the other.
optimisation process typically starts with single threaded code , and reworked to improve utilisation of all 3 levels with this pattern..

GPU -SPMD - low complexity, high throughput
   
      only suitable for very large batches, larger cost to fork/join   
      too limiting to handle *everything*

      but much higher *throughput*
      great for pixels, vertices, huge numbers of particles
      with more effort, can be applicable to more, but there's a good reason we still have CPUs..

                         main
     /   | | | | | | |  | | | | | | | | | | | | | | \       
     |   | | | | | | |  | | | | | | | | | | | | | | |

*** driving the epiphany (or CELL SPU's) this way is a Fail. *** GPU handles this cases better, for less programming effort.

EPIPHANY -MIMD  manycore - the opportunity is to handle complex code- like the CPU - but with GPU levels of throughput.

However.. given that epiphany ditches SIMD and Deep PIpelines for more cores,
you need to spawn more tasks for inner loops- and manage more fork/join communication.
This programming overhead kills the usability , hence limits adoption.

**this is the software problem that needs solving ** - how to write complex patterns of multi-level parallelism
as conventiently as C++14 and Complex CPU's can do it today.


Future permuations? they might do 4 ARM cores + 64 epiphany small cores + some GPU, all sharing a final level cache to share intermediate data onchip .. whatever. Epiphany cores should be able to take some of the load of CPU work, or rendering Geometry Shaders/Vertex shaders, and take over GPGPU work. Eventually the GPU could wither and just become texture units accessible to the epiphany.. the CPU could wither too, as mainstream code gets more parallel
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby piotr5 » Thu Aug 13, 2015 8:18 am

dobkeratops wrote:(i know in the current board that's restricted to a 32MB window, but a strong recommendation i would give is to maximize the amount of space that is addressed the same.


AFAIK it already is maxed out, from the hardware side. what is lacking is linux-support. linux should distinguish between a secure shaded memory and insecure shared memory. as we know, each core in the virtual 64x64 grid from epiphany's point of view, has a memory-range of 1M assigned. in total the grid has 4k times 1M, 4G alltogether. if epiphany cores are in NW corner of that grid, accessing first 4M of memory will attempt to access the first 4 epiphany cores. so first 4M are shadowed and thereby protected from being altered or read by epiphany programs. that's where linux should store encryption and all those things. after that area, there's 60M of memory that can be accessed from epiphany in exactly the same addresses as from ARM. then another 4M of shadowed, again followed by 60M of shared memory. same with the 4M area starting at 256M, except that wont be shadowed by the actual cores but any access there will be sent through south eLink. and at address 260M again a shared area. in total there's shadowed areas at 0, 64, 128, 192, 256, ..., 512, ..., 960, that's 16 times 4M, 64M of shadowed memory available for OS and root user, and 16 times 60M of shared memory, 960M total for the application that's using epiphany, split in 60M chuncks. i.e. all alloc will return 60M continuous memory at most! currently software isn't taking this limitation into consideration, maybe again something that needs to be introduced into the compiler, the idea of non-continuous arrays as drop-in replacement for the continuous ones -- especially in kernel-programming which would have to juggle with 4M blocks of continuous free mem...

as you said, we are on the same page. my arguing is for you to comprehend that ideas alike to Domain Driven Design (DDD) are needed for allowing a simple program to be extended into a complex one without losing the ability to extend into much more complicated shape even further. if you do split your functions and your data accordingly from start, then you wont face such problems like you did. you simply cannot parallellize a badly written program by hand, and automatic parallellization will have its limits then also. that's why I don't acknowledge the problems you see. however, one problem I do see which you seem to neglect: a heterogenous system needs multiple compilers. on parallella part of the program needs to be compiled for ARM and part for epiphany, and maybe another part for fpga. therefore if you're in the middle of a loop, you likely wont just replace loop-body by an epiphany version, instead you'll put that loop onto epiphany as a whole and let other copies of the loop-body spawn from there. on the other hand, if you program fpga then you'll put loop-body (maybe jit) onto the hardware. the reason is fpga is high-throughput, while epiphany is low latency with long lag for communication outside. when optimizing you must keep this difference in mind, and structure your program accordingly. hence one whole domain will be executed in epiphany, while arm is making use of pipelines it formed on fpga. you said complexity of the rest is high, but this is not fully true. in a single application there are many domains to consider, a game is taking care not just of graphics, but also physics, AI and sound. each of these is a domain of its own, all running in parallell, all extremely complex. but remove them from your program and the remainder is quite simple in comparison to each of them. hence my suggestion is to put all of them onto epiphany (well, maybe let fpga handle sound), arm then becomes a mere helper commanded by epiphany, serving data or replacing old stuff with more urgently needed one, and mainly taking care of user-interface, and all the underlying logic in the application-layer. if epiphany needs to use fpga, let it, put it into command. this way you lose one particular complexity in your program: you wont need to juggle multiple compilers with a single source-code file. instead on epiphany sourcecode you just implement all this par_ stuff and let it just perform dynamic loading of programs already stored on epiphany, onto some other epiphany core. i.e. the epiphany program is compressed into 32k and expands over the whole chip...

btw, there's an example for creating additional memory on fpga. has anybody tried out how quickly epiphany can read or write there? I suspect it'll be faster than main memory...
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Thu Aug 13, 2015 12:20 pm

btw, there's an example for creating additional memory on fpga. has anybody tried out how quickly epiphany can read or write there? I suspect it'll be faster than main memory...


Thats' extremely interesting - i'd posted a similar question. I think this would be highly useful for ARM<->epiphany communication, bypassing DDR. Even if its' not faster, the fact it doesn't burn up bandwidth will help - but I suspect, as you claim, it will be faster, because DDR is the other side of the FPGA (right?), its' 1 less 'hop' between physical chips on the board?

as you said, we are on the same page. my arguing is for you to comprehend that ideas alike to Domain Driven Design (DDD) are needed for allowing a simple program to be extended into a complex one without losing the ability to extend into much more complicated shape even further. if you do split your functions and your data accordingly from start, then you wont face such problems like you did.


This works sometimes, and we all knew this when we got the PS3. We had the PS2 before, which already introduced us to 'separate instruction set architectures' (CPU + DSP's)
Only problem is between PS2 and PS3 the complexity of game engines multiplied enourmously.

The complexity will keep exploding until we have the 'star trek holodeck'.

Code you wouldn't have predicted needs to be accelerated , needed to be accelerated. The glue between systems matters (see below).

you simply cannot parallellize a badly written program by hand, and automatic parallellization will have its limits then also.

The process is that we re-work data until parallelisation is possible. e.g. multiplying out character joints - the nieve algorithm cannot be parallel (e.g. a tree of joints, or an array of joints with parent pointers). But then we'd divide the joints into layers, each joint in each layer can be computed in parallel. So you move from say 128 joints into a collection of 3-4 batches of ~16 parallel joints.
you even need to do this for maximum performance on a single thread, because its' really a parallel machine internally with deep pipelines, SIMD, OOOE..
and if i've understood epiphany's approach (many, simple cores) what you do with single threaded loops on a normal machine will have to be spawning sub-tasks on the Ephipany...
that's why I don't acknowledge the problems you see.

I'm talking from experience. The similar CELL architecture has been and gone in the games industry. This is why. We already had the concept of 'middleware', separate libraries. which had started to become popular since the PS2 era. The problem is the fine grain interactions, so the separate libraries really become a complete engine. So what matters is the process of developing the library. There is way more work & runtime cost in the integration between modules than you seem to think, and its' far less predictable than you seem to imagine.

however, one problem I do see which you seem to neglect: a heterogenous system needs multiple compilers.

Yes. thats what we had on the PS3, and thats' why sony ditched it. It complicated game development so much that developers always started on the xbox360. Sony knew they needed to aim at being the "lead platform".

But this is a theoreticaly solvable problem. Imagine if you could just tag a function with a pragma, and the compiler would spit out the 'glue' moving between one core and the other. that's *what it needs*. I use the analogy of registers and stack , two physically different representations of the same logical entity ('a variable').

And just like lambdas automatically extract the shared variables between a function body and the lambda - we need to be a able to split code *at that granularity*.

You're not going to get 100s - 1000's of parallel operations if you need to manually manage it.

on parallella part of the program needs to be compiled for ARM and part for epiphany, and maybe another part for fpga. therefore if you're in the middle of a loop, you likely wont just replace loop-body by an epiphany version, instead you'll put that loop onto epiphany as a whole

That, right there, is what is too complex.

Standard CPU's have deep pipelines and SIMD giving parallelism within one instruction stream. To properly utilize epiphany, you'll need to split those off manually. It's too much work to be practical by hand. It's almost like going back from C to assembly.

earlier you stated 'autoparellelizing loops rarely works' - but this is the process we were forced to go through for 'in-order, deeply pipelined' processors. It rarely works initially, but you re-work your data until it does. Resulting in the pattern I showed above.

And the granualrity cannot be determined up front.

these loops are scattered throughout the sourcebase.

and let other copies of the loop-body spawn from there. on the other hand, if you program fpga then you'll put loop-body (maybe jit) onto the hardware. the reason is fpga is high-throughput, while epiphany is low latency with long lag for communication outside. when optimizing you must keep this difference in mind, and structure your program accordingly.

hence one whole domain will be executed in epiphany, while arm is making use of pipelines it formed on fpga. you said complexity of the rest is high, but this is not fully true. in a single application there are many domains to consider, a game is taking care not just of graphics, but also physics, AI and sound. each of these is a domain of its own, all running in parallell, all extremely complex.

no.
there is fine grain communication between each part. Code must be shuffled around constantly during development. 'each in a domain of its own' but with much more fine grain communication than you seem to realise.

for example, sound effects are driven using intermediate data from the physics state (the forces from the wheels). Or details of the animation constraints and physics might need to be fudged for a character holding a gun to accurately aim at an onscreen cursor, in a way that isn't physically accurate but is just easier to control from a certain view.
You wont know that until after you handed it to designers and they changed their mind how they want it to work..

This is a creative process.

People will come up with new ideas constantly. Thats just two examples I pulled out.
Code needs to be in a fluid form. you cannot plan a rigid structure up-front. The interfaces change.

The point is, you're constantly trying to find ways of doing things that other people haven't done yet.

It is an inherently exploratory, experimental, creative process.

we've been through this process and sony found out the hard way, its' impractical.

but it is theoretically possible - we are just waiting for better tools.

and as I understand, the point of the Parallella project is to inspire the community to actually develop these tools in the open - so here we are, brainstorming..

this architecture has been around for ~10 years and its' not commonplace because current tools are not sufficient. infact if you consider it as an extention of the CPU+DSP approach it goes back much further. ( I know phone SOC's have specialized DSP's but they don't interact with the rest of the system at the same granularity as components of a game engine)

whether its' JIT, whether its' a compiler plugin, whether its' AI using feedback from a profile build ... we need to get to a point where parallelising can be done routinely without too much manual programmer effort - because thats' what we're used to with intel hardware already.
it needs to be done at compile time rather than at runtime.
Last edited by dobkeratops on Thu Aug 13, 2015 12:56 pm, edited 3 times in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby sebraa » Thu Aug 13, 2015 12:37 pm

piotr5 wrote:AFAIK it already is maxed out, from the hardware side. what is lacking is linux-support.
The 32 MB limit stems from the Epiphany's direct mapping between core and address, and the wish of the developers to provide a continuously adressable block of memory.

piotr5 wrote:linux should distinguish between a secure shaded memory and insecure shared memory.
I think you are just rambling.

You can't have two 32-bit address spaces on a 32-bit processor without overlaying them in some way, and that is exactly why the Epiphany cores on the Parallella are not located at (0,0). So Epiphany addresses never overlap ARM adresses, so when any of these chips adresses any adress, it is clear where the access needs to go. Whether the accesses are actually done depends on the silicon connecting all parts. But this has nothing to do with "secure memory", "root user memory" or anything along those lines. You are mixing all kinds of concepts together, producing a somewhat incoherent description of a system which may-or-may-not work, without taking the consequences into account (or just explaining them away).

That is philosophy, not engineering.

piotr5 wrote:btw, there's an example for creating additional memory on fpga. has anybody tried out how quickly epiphany can read or write there? I suspect it'll be faster than main memory...
The main limit currently is the interface between Epiphany and FPGA, and the main memory is fast enough to saturate it. So no, it (currently) wouldn't be faster than main memory.

Have you ever programmed the Epiphany, in, like, a real project?
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Thu Aug 13, 2015 12:47 pm

sebraa wrote:The main limit currently is the interface between Epiphany and FPGA, and the main memory is fast enough to saturate it. So no, it (currently) wouldn't be faster than main memory.


A question I have is , would you get less latency: latency can be as big a problem as bandwidth.

I don't yet have a parallella, the problem I describe above interests me.

From what I gather reading around the actual performance of the board will be disappointing - its' intended for experiment. I'm familiar with the process of getting experimental hardware, and the next version is different based on lessons learnt from the prototype.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby sebraa » Thu Aug 13, 2015 2:23 pm

I would say that latency doesn't matter here.

You have to differentiate between "local" accesses (core-local 32 KB memory) and "remote" accesses (anything else).
For local reads and writes, you won't experience any latency.
For remote writes, you won't experience any latency (assuming an idle mesh); writes are fire-and-forget. If the mesh is busy, there is round-robin arbitration.
For remote reads, you always have a huge latency, so you should avoid them in any case.

The available bandwidth is defined by the external eLink interface, not the memory you attach to the other side. So I wouldn't expect much difference. Also, FPGA RAM is expensive.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Thu Aug 13, 2015 3:25 pm

sebraa wrote:For remote reads, you always have a huge latency, so you should avoid them in any case.


but we have situations where we need to calculate something that affects the address of the next data to read. This is unavoidable. GPU's use many threads to hide such latencies. CPU's use threads & OOOE.
anything really interesting requires a complex memory access pattern (e.g. raytracing, I mean against a complex polygon soup not just some spheres...)

Parallella would have to use some deferring mechanism. (trace against top level grid, spit out lists of ray fragments with requests for grid cells; sort those requests; stream ray fragments & grid cell contents those into local-stores .. repeat spawning more fragments ..*)

Temporary FPGA memory would be a great place to accumulate such deferred temporary data, existing for the same reason we have L2,L3.. cache.

Point taken that FPGA memory is a precious resource*.
The ideal setup would be for the ARM, FPGA, and Epiphany etc to share a single big cache buffering everything that goes out to DDR - this was done with a shared L2 in the CELL (PPU and SPU DMA went through this, so you could get data between PPU & SPU without hitting main memory)
(* ... speaking of which, I wonder if you could implement a sort algorithm on the FPGA, and an 'indexed gather' DMA channel, exactly for handling the kind of case I describe above)

Another question I have is could the FPGA implement an shared cache (servicing both ARM & eLink transfers, lazy writes to DDR, and avoiding some temporaries ever reaching DDR).


Reading around bit more, it seems OpenMP (which I've overlooked) can handle nested parallelism. And I gather the SDK does have openMP support.I wonder what cases are supported.
Can a piece of ARM code already fan out spawning epiphany threads.. or would you still need 'the whole application' on epiphany to use this effectively.

The approach I'm familiar with is manual nested parallelism: (threads x (pipeline+simd)), but on epiphany that will need to be (cores x more cores), and if you have complex outer code maybe (ARM cores x Epiphany cores).
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby sebraa » Thu Aug 13, 2015 9:04 pm

dobkeratops wrote:but we have situations where we need to calculate something that affects the address of the next data to read. This is unavoidable.
Local reads don't have latency, remote reads do. This is unavoidable, too.

dobkeratops wrote:Temporary FPGA memory would be a great place to accumulate such deferred temporary data, existing for the same reason we have L2,L3.. cache.
Temporary FPGA memory would neither be faster nor have a lower latency than the current main memory. Since the Epiphany is an uncached architecture, you have to make sure beforehand that your data is where you need it. If you can't guarantee this, then Epiphany will not perform well; this is by design.

dobkeratops wrote:Another question I have is could the FPGA implement an shared cache (servicing both ARM & eLink transfers, lazy writes to DDR, and avoiding some temporaries ever reaching DDR).
Your local memory is the perfect spot to put temporaries: it is incredibly fast, has no latency, and it scales to millions of cores. Any shared cache system will not scale.

dobkeratops wrote:Can a piece of ARM code already fan out spawning epiphany threads.. or would you still need 'the whole application' on epiphany to use this effectively.
The ARM code loads code into the Epiphany, which then runs it.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

PreviousNext

Return to Programming Q & A

Who is online

Users browsing this forum: Google [Bot] and 5 guests

cron