generating epiphany code from templates, possible?

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Thu Aug 13, 2015 10:00 pm

dobkeratops wrote:Can a piece of ARM code already fan out spawning epiphany threads.. or would you still need 'the whole application' on epiphany to use this effectively.
The ARM code loads code into the Epiphany, which then runs it.

I mean for openMP

Code: Select all
main() // ARM
{
    do stuff // ARM
    #pragma omp ..... //
    {             // epiphany
        ..do stuff..
    }
    do stuff // ARM
}

or even
Code: Select all
main() {
     #pragma omp
     {    // ARM, outer 2 threads
          // ARM     
          #pragma omp
          {         // epiphany inner N threads
                   do innermost stuff.. (on cores optimized for working in parallel on limited sequential data)
          }
          // ARM does final reduction (a core optimised for working serially on large data)
          //...  (what would be perfect is if the communication here between epiphany & ARM for temporaries was in the L2 cache)
     }
}




sebraa wrote:
dobkeratops wrote:but we have situations where we need to calculate something that affects the address of the next data to read. This is unavoidable.

Local reads don't have latency, remote reads do. This is unavoidable, too.

the local stores on the epiphany are tiny, some extra buffering doing useful sorting work might have value.

One option would be to spawn the random load, then 'yield' (swap to another task on the same core). Or push the state and the load to another core's message queue (but this still means chewing up precious on chip data)

Or perhaps make the core pause, powering down, allowing cooling. (the memory can still be doing work keeping data on chip, so its' not like its' wasted).

latencies are unavoidable, but the powerful rendering we have on GPUs are a testimony that addressing large datasets *can* be tamed- I think its' just random reads & writes in the same area that's problematic.

dobkeratops wrote:Another question I have is could the FPGA implement an shared cache (servicing both ARM & eLink transfers, lazy writes to DDR, and avoiding some temporaries ever reaching DDR).


Your local memory is the perfect spot to put temporaries: it is incredibly fast, has no latency, and it scales to millions of cores. Any shared cache system will not scale.

but a shared *outer* cache for multiple inner uncached machines might be a compromise between the extremes (everything cached, or everything uncached). Caches aren't inherently bad, its' just multiple coherent caches that are a problem, as I understand it. Is ditching the final cache level just throwing the baby out with the bathwater.

It would be interesting to compare various shades of grey between the extremes
1024 cores x 128k local-store,
vs
or 1024cores x 64k local store + 16mb of outer shared L2 write-back cache
or even
or 1024cores x (32k local store + 16k of read-only L1 cache, only works for pages marked read-only, or needs a 'sized precache' instruction that marks which part of the lines are valid, to leverage compile time knowledge of fine grain immutability/unique ownership via 'restrict' pointers..)
+ 16mb of outer shared L2 cache
or even L3 cache, 4 coherent L2 caches for each quarter of the grid (16x16 cores) etc.


there are some tasks like raytracing or rendering with a complex data flow where you either need caches, or some large intermediate data-store to sort.

I can certainly see how you could sort chunks of textures into tiles and so on.. but the problem with putting ALL that intermediate data in *local memory* is the local memory itself is a precious resource - you want it active - I'm talking about data thats' going to be streamed out, and back in again - and probably unavoidably large - if you don't have caches, you need some large buffer in the middle.

Tiled renderers attempt to do away with random access of frambbuffer access, but to do so they perform scene capture: - i.e. transformed vertices have a longer lifetime requiring a large buffer to sort them into tiles.

Its' a tradeoff, you can't wish away the complexity of random addressing - the cost pops up again somewhere else.

Fancy, say, trying to adapt a C++ compiler to run *on the epiphany cores* .. if it is a truly a general purpose computing solution

I think whats' going on is - any attempt to rework code to not use caches is going to have some sort of sorting stage.

A cache is providing hardware assist for sort. (hardware bucketing of data).

for reference the CELL didn't entirely ditch caches: it did have a shared L2, and tiny 'atomic operation' caches to assist arbitrating between tasks.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby sebraa » Fri Aug 14, 2015 3:01 pm

dobkeratops wrote:
dobkeratops wrote:Can a piece of ARM code already fan out spawning epiphany threads.. or would you still need 'the whole application' on epiphany to use this effectively.
The ARM code loads code into the Epiphany, which then runs it.
I mean for openMP (...code...)
I don't see why this shouldn't be possible. However, you would need to compile your code fragment into a fully working application, then send it to the cores you want to run that code on, then send the data you want to operate on to the respective cores, let them compute, then collect the results from the cores. If you have pre-compiled fragments available on the Epiphany, you only need to do the last two steps.

dobkeratops wrote:
sebraa wrote:
dobkeratops wrote:but we have situations where we need to calculate something that affects the address of the next data to read. This is unavoidable.
Local reads don't have latency, remote reads do. This is unavoidable, too.
the local stores on the epiphany are tiny, some extra buffering doing useful sorting work might have value.
And if that extra buffer is too small (remember, FPGA memory is tiny, and wouldn't be much faster than the current DRAM!) you still have to go out to even larger and slower secondary memory. If your problem is too large and you can't cut it into smaller pieces (which fit the available fast memory), then you will not be efficient. On the Epiphany that means 32 KB per core and if you can't fit it there, it will not be efficient. Don't compare a sub-1W, 16-core, 512 KB Epiphany to a over-200W, 1500-core, 2 GB GTX680. If your problem doesn't fit the tool, then you have to either change the problem or the tool.

dobkeratops wrote:there are some tasks like raytracing or rendering with a complex data flow where you either need caches, or some large intermediate data-store to sort.
Then use a system which has either caches or some large intermediate data-store. Choose the right tool for the job.

dobkeratops wrote:Fancy, say, trying to adapt a C++ compiler to run *on the epiphany cores* .. if it is a truly a general purpose computing solution
A Z80-based system with 64 KB of RAM is a general purpose computing solution, too (it powered computers for over a decade), but you wouldn't fancy a C++ compiler (or even a full C-compiler) on them either, will you? The Epiphany is not a Desktop CPU, nor is it a GPU, and if you imagine using it as those, you will be disappointed.

dobkeratops wrote:for reference the CELL didn't entirely ditch caches: it did have a shared L2, and tiny 'atomic operation' caches to assist arbitrating between tasks.
And if the cache was too small, you had to go to main memory instead, paying the price.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Fri Aug 14, 2015 4:35 pm

And if that extra buffer is too small (remember, FPGA memory is tiny, and wouldn't be much faster than the current DRAM!) you still have to go out to even larger and slower secondary memory


sure, but everything is shades of grey.

Look at it this way - how about having a default configuration where the whole FPGA is used for something useful. If its' memory reduces the amount that DDR needs to be used when communicating between cores or between ARM & cores - that frees up the DDR bandwidth for other purposes (simultaneous use by ARM and Epiphany).

Or by buffering/batching loads & stores for you it might make better use of DDR burst transactions.

it would increase the size of your working set, before needing to go 'off-chip'. I do believe that's helpful for a sorting stage.

(then if someone has a better use for something specific.. they can tailor it. Remove that cache, replace it with something dedicated, and the same software still runs, some parts will be slower but other parts will be faster.)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby piotr5 » Fri Aug 14, 2015 8:19 pm

@sebraa: thanks for the clarification, I completely forgot about the virtual-memory at core (0,0). nonetheless, truth is those 32M shared memory is set aside before linux starts to prevent programs from occupying it, to allow epiphany programs to use a consistent memory-location. so you're right, but I insist that theoretically one could re-program linux to cooperate with epiphany programs somehow. there's 64 rows and 64 colums of 1M memory-cells addressable by epiphany. only 4M per row are used on the 16-core version. the other 60M are available for shared memory. 960M in total are addressable with same address from epiphany as well as from ARM side. what we don't have is linux-support for those 60M regions. is there a placement-new requesting a specific memory-address? if not then compiler-support and support from libc would be needed too. some secure linux should attempt to protect stuff from being altered by epiphany...
dobkeratops wrote:I'm talking from experience. The similar CELL architecture has been and gone in the games industry. This is why. We already had the concept of 'middleware', separate libraries. which had started to become popular since the PS2 era. The problem is the fine grain interactions, so the separate libraries really become a complete engine. So what matters is the process of developing the library. There is way more work & runtime cost in the integration between modules than you seem to think, and its' far less predictable than you seem to imagine.

I have high respect for your experience. but times change, maybe some new ideas are available now which you didn't have back then? for example I observed an interesting trend towards usage of sourcecode instead of libs. i.e. don't use binaries, every single piece of the software is compiled together with your whole project. this way it becomes possible to move the 3rd party code onto another processor architecture and code for all the different kinds of phones and tablets. I suspect the reason is multicore systems, as started by amd's 8-core system and the highly energy-efficient intel-versions. you tell the building-system how many threads it should start, and the compiled product is available after a few minutes. especially when it comes to object-orjented programming with sourcecode you can improve the structure, as you said the system is always changing and should never be trapped in a fixated structure from beginning. the actual sourcecode wont need to be changed, but the administrative stuff in headers and such will need the change, and caused by that also actual splitting off bigger functions into smaller ones will need to happen. wont work with closed source.

btw, why do you keep talking so much about sorting? as I understood, in graphics the POV is changing smoothly, slowly moving on a fixed line, so also the z-distance data will change gradually and no actual sorting is needed since most of the time everything is already sorted and only small changes happen as objects are passing at their minimum distance to the movement-curve. did nobody ever get the idea to handle it more efficiently? are all programmers sorting the whole set of triangles over and over again? (of course graphics is just an example, many other areas too go through only smooth changes and require no full sort algorithm.)

anyway, back on topic. afaik openmp is using message-passing, so you could pass code in a message in the same way you command your video-card to compile some c-program into pixel-shader. but isn't that a big overhead? an external level of cache is a nice idea, but theoretically it wont help much. in practice however, someone should test it, maybe there still are bugs in how memory is handled. as I understood, epiphany frequency is tuned to match fpga-frequency, which is supposed to match ARM, which again must match somehow the DDR frequency. not sure if epiphany and DDR work well together though. I remember somewhere in the forum it was mentioned they don't, or rather didn't on the old fpga image...

again my point of view, since you didn't reply to it thoroughly enough: such things like inner loop and outer loop are imho ideas from the past, I always try to program in a way that all for and while instructions are located in the header-files of the c++ programs. and of course I make use of the std-lib instructions for processing loops with lambdas and such, but they too go into header-files mainly. and I prefer to avoid tail-recursion. therefore I'd never get into a situation of trying to optimize out some loop-body to be executed on other hardware. as you say, the communication between multiple parallell loops is what the sourcecode should focus on, that's where complexity actually needs to become disentangled. so, in addition to my whole loop-avoidance, what else do you suggest is required for parallell programming? why do you keep insisting on pragmas to steer which software will be responsible for compiling my sourcecode? I see objects as non-hierarchical, therefore I don't understand your concept of rough-grained and fine-grained loops. of course I have an arm and I have fingers on its end. moving arm will move fingers, so there is a clear dependency. the other hand is largely independent though. but I never would see fingers as a more fine-grained version of the arm. there's no hierarchy there, except for a static similarity between arm and finger, a similarity that exists only at compile-time. same with objects splitting up into multiple triangles. of course the object contains the triangles, but wouldn't it be better to treat a container of triangles as something distinct from the actual object? till you need to draw it, the triangles can be manipulated in relation to the object's center, only objects within a certain distance can affect them. but even such nearby object will only interact with a small part of the current object's container of triangles. in other words, putting triangles in a container imparts a certain order on them, the object however is composed of many different ways to sort the triangles, each must be represented by some container of pointers to their actual data. these different ways of sorting belong into the main object, while triangles might be better fit in a global container of textures since a single triangle might wander from one object to another in case the object breaks or gets damaged. again just an example of how to replace your idea of fine-grained and coarse-grained by something that actually has a meaning in the very domain-language programmers would use daily. coarse or fine are concepts from a meta-language, they don't represent the actual activities of the program!
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Fri Aug 14, 2015 9:10 pm

Interesting discussion.. thanks for partaking :)

for example I observed an interesting trend towards usage of sourcecode instead of libs.


Right. nail/head.
... thats' because its' easier than splitting code into strict 'domains' (which is what was tried first)... you need fine grain interaction.
Also C++ allows you to inline and specialise a lot with template libraries (the best bit!) - so even if your code can bit split at source, when its' actually compiled its intermingled.
You miss this if you use binaries. Its' this specific ability that i'm after here. (when I say lambdas, I do want them to inline into a stub on the epiphany)


I have high respect for your experience. but times change, maybe some new ideas are available now which you didn't have back then?

I've always been highly theoretical :) and of course i've seen a lot along the way. I'm still even here because of speculation, not experience. (experience would say "don't touch it with a barge pole!"). And yes things have changed, a lot.

It was the experience of the PS3 and Xbox360 that got me to research functional programming. (the experience there of course was, 'C++03 was not sufficient..')

this way it becomes possible to move the 3rd party code onto another processor architecture and code for all the different kinds of phones and tablets. I suspect the reason is multicore systems, as started by amd's 8-core system and the highly energy-efficient intel-versions.


so to refine this picture: what happened in 2005-2006 was, microsoft came to us with the 360 - which was a multicore design - which extrapolated ahead of the existing PC curve by going multicore-SMP. Sony came to us with the PS3 - very similar to epiphany -and an extrapolation of the dual DSP's the PS2 had. More power on paper. more 'on-chip memory', more computational power.

What sony said was, "we hope middleware will use the SPUs" (we need a word, 'little-cores' whatever) - but what happened in practice is we had all that hugely complex code being developed already - and we could adapt to new ideas >2x as fast working on the xbox360. So it became the Lead Platform. Around that time there was the Ageia-PhysX chip too , for PC's (again, 'little-cores' on an expansion card ). I remember explaining back then how encouraging the parallel evolution was.. and I was wrong :) the future was SMP, not little-cores.

The problem with SPU's is , you don't have enough time to refine everything - you have deadlines.. things *must* be released by a certain date.. And whilst it handled fully optimised code amazingly well, on the 360 you could get '80%' of the performance with maybe 1/4th the effort (whatever those numbers are - that was the strategic effect). Out of the box, SPU's couldn't even run most programs. I wrote a set of lazy smart pointer DMA wrappers to facilitate a quick port, looking around it seems eventually there were attempts to compile 'normal' code with a literal software emulated cache.. they were that desperate to get any use of those cores.

PC's then evolved down the multicore route. The PhysX was squeezed out between SMPxSIMD on one side and GPGPU on the other - a great shame - and this is the exact space the Epiphany could be going for (r.e. a PCI card).

I'm stubborn, I still think the world did something wrong - shoe-horning general purpose power into a GPU, and stretching the intel architecture so far (those ancient 8bit instructions driving a deeply pipelined vector-machine.. ?!)


btw, why do you keep talking so much about sorting?

Good question.
When you look at any code that uses random access through a cache, and you think "how the hell do we do this with DMA" - the answer is Sorting.
instead of random accesses, you spit some sort of buffer of requests, then sort them, then taverse the data along with requests per data.

(Today something similar has been popularised as "Map-Reduce" on clusters (take a data set, map producing key-values, sortintermediates by key reduce the values for each key.).)

example in graphics: z-buffer vs tile-based-defered rendering (not to be confused with deferred shading)
z-buffer rendering - random writes into a big frame-buffer might use a cache.
How to get rid of that? - accumulate primitives, 'key'=onscreen grid cell.. sort into each tile - 'reduce' - gather the primitives for a tile, render on chip, spit out the finished tile.

Map'Reduce is very general , the tiled rendering example is very specific.

This is what I want to do in cross platform templates... build up a library of slightly specialised functions like that and you adapt them to different situations by plugging lambdas in.

Accumulating & Sorting requests - thats what this FPGA memory might be good for.?
I wonder if the FPGA could actually sort for you ... could you assist the epiphany with a sorted,indexed DMA channel... almost hardware assist for something like MapReduce..

in graphics the POV is changing smoothly, slowly moving on a fixed line, so also the z-distance data


right i'm talking about sorting *in many different contexts* to get around the absence of a cache. sorting objects by grid cells for collisions, sorting game-script messages sent to actors,
.. I just talk about graphics a lot because it's been my interest.

I think you can apply the epiphany architecture to any problem, eventually, with this approach. Its' just today most software leverages the inbuilt sort ability of cache and all our source is written that way.

As he mentioned separate memories scales much further. Future.. moores law slowing.. but optical interconnect is a potential tech.. maybe it will become easier to cluster huge numbers of cores with optical interconnect between them.. (i also wondered if they'd extend the epiphany grid in 3d, or just use layers for stacked memories to make the 2d cores smaller..)

so you could pass code in a message in the same way you command your video-card to compile some c-program into pixel-shader. but isn't that a big overhead? an external level of cache is a nice idea


Basically the complexity appears somewhere.. whether its a temporary buffer for sorting... whether its L1 caches with idle threads to hide latencies - the cost must be paid - of course one approach might be better than the other.

GPGPU uses an L2 cache to communicate between threads, and for coherence writing to frame buffers. (so it's not writing individual pixels, rather caching a small tile, drawing nearby pixel fragments into it, etc..)

what else do you suggest is required for parallell programming? why do you keep insisting on pragmas to steer which software will be responsible for compiling my source code?


[1]My own opinion is high order functions & lambdas are the way to go, which C++ does quite well now. And the issue I see here is 'how to take that from SMP to Epiphany.'
[2] Something to automatically generate pipelines, perhaps automatically splitting expensive functions off. (e.g. like a piece of game actor code doing retraces.. these would be deferred, and the code either side split into 2 stages, the second taking the ray result).
Perhaps this can be done with compiler plugins, or analyzing IR, ..

Without '2' we can do this semi-manually,e.g write a function 'map_defer_map(data, first_stage, expensive_function, second_stage)'. Thats' why i'd go for [1] first.

I speculate Coarse x Fine x Pipelines would saturate 100's of cores just fine :) for the kind of complex CPU code in todays game engines (e.g. 8x8x4 .. numbers in that ballpark) and of course with 100's-1000's of cores you'd do more elaborate physics, AI in different ways (more like GPGPU code).

That is of course open to question, others may have different ideas.

I see a GPU as being a big hardware 'high order function' specialized for rendering, taking a Vertex Shader and Pixel shader like lambdas. Its success is because of the simplicity of slotting shaders in- hence many people could use it.

why do you keep insisting on pragmas to steer which software will be responsible for compiling my sourcecode?

OpenMP is similar but as you say, limited, too much is built into the compiler, and it only gives you one pattern. Building templates for a traditional SMP threaded machine you have a lot of control. (i think its' great that someone's done OpenMP support here)

I take a lot of inspiration from what I see in the Haskell world & LISP, but I don't personally like haskell enough to use it. (LISP is usually too dynamic, but its' the origin of the whole 'high order function' approach.. the term 'map-reduce' comes from their use of 'map' and 'reduce'..)

I see objects as non-hierarchical, therefore I don't understand your concept of rough-grained and fine-grained loops. of course I have an arm and I have fingers on its end.


There's still locality to exploit here. e.g. on the PS3, you could fit the intermediate workspace for a character on 1 SPU. 'for each character..' - its' very natural to get a whole SPU working on each. (epiphany cores are smaller, 32k instead of 256k, so perhaps '1 character' would actually be handled by 4-8 cores, overspil in each others' memory? or arbitrating via the external buffer? or one master points at data in subordinates?)

Calculations for one character don't affect the next, but there are many interactions *inside* one.
Then finer grain parallelism is used within the localstore (some work between the individual joints is independant).

Hierarchies usually mean pointer-chasing, which is bad. But what I"m referring to here is data locality, which is something good that you can exploit.

Trees are not so good now (too much pointer chasing), but a 2 level split into 'major' and 'minor' is. (tiles & subtitles, objects sorted into grid chunks.. whatever.).

In physics engines you might encounter the phrases 'broad an narrow phase' and 'islands'.

of course the object contains the triangles, but wouldn't it be better to treat a container of triangles as something distinct from the actual object?
[/quote]
So as you correctly observe, these days on PCs/phones the actual vertices & triangles stay with the GPU. cpu data structures just own and reference them.

Code: Select all
types of processing on horizontal axis.

high versatility,complex interactions   <-------------------> high throughput, specialised, simple interactions
serial                                     <--------------------------------->                         easier parallelism

control code     AI       collision/physics      animation                 vertices       texels/pixels
   (not exact, these functions spread along this axis, roughly in this order)


Distant past PS1..
[Mips CPU  ]                                ['GTE' vector coprocessor, works like fpu]                         [tinyGPU]

PS2
[Mips CPU]                              [  VU0 DSP        ][VU1 DSP               ]   [         simple GPU   ]

past - PS3:-
[PowerPC ]        [------------- Cell SPU ----------------------]                [---------- GPU------------]

present day PC, Phones, APU, pS4:-

[-------------------CPU  SIMD--------------][  ---------------------------------GPGPU  GPU-----------------]

manycore future??   <<< This is what I want :)
[legacyCPU][ ------------------------...epiphany cores....???-----------------------------------][simple GPU]


dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby sebraa » Mon Aug 17, 2015 12:09 pm

dobkeratops wrote:Look at it this way - how about having a default configuration where the whole FPGA is used for something useful. If its' memory reduces the amount that DDR needs to be used when communicating between cores or between ARM & cores - that frees up the DDR bandwidth for other purposes (simultaneous use by ARM and Epiphany).
How much memory do you think you can get? If I read the spec correctly, we are talking of about 256 KB of memory here. On the other hand, you lose the ability to use any other FPGA together with the Epiphany, which doesn't seem really useful to me.

piotr5 wrote:nonetheless, truth is those 32M shared memory is set aside before linux starts to prevent programs from occupying it, to allow epiphany programs to use a consistent memory-location.
Yes, and the same memory layout works for the 64-core version. This is important, you know? If you are designing a platform (the Parallella is a platform!), then you should not tailor it to the smallest possible configuration, but to the biggest possible configuration instead.

piotr5 wrote:so you're right, but I insist that theoretically one could re-program linux to cooperate with epiphany programs somehow.
Given some changes to the hardware and the Linux kernel, probably. Without changing everything, currently no.

piotr5 wrote:there's 64 rows and 64 colums of 1M memory-cells addressable by epiphany. only 4M per row are used on the 16-core version. the other 60M are available for shared memory. 960M in total are addressable with same address from epiphany as well as from ARM side. what we don't have is linux-support for those 60M regions.
Actually, we do ... it is called mmap(). But you need contiguous physical memory, which is not too easy to get on any decent operating system; it is possible however. You would need to extend the FPGA logic with some kind of small MMU to map the Epiphany addresses to some sensible physical addresses, matching the (userspace) memory allocations.

So yes, it is possible. It is just not implemented, and as long as nobody implements it, it won't be done. Your turn now.

piotr5 wrote:I have high respect for your experience. but times change, maybe some new ideas are available now which you didn't have back then? for example I observed an interesting trend towards usage of sourcecode instead of libs.
I like how you think everyone else is stupid. And no, people do not turn to "source code instead of libs", they turn to intermediate, high level, assembly. Think GLSL, SPIR-V, CUDA, LLVM IR. It saves the runtime cost of analysing source code, but keeps the freedom to run it on different architectures.

piotr5 wrote:an external level of cache is a nice idea, but theoretically it wont help much. in practice however, someone should test it, maybe there still are bugs in how memory is handled. as I understood, epiphany frequency is tuned to match fpga-frequency, which is supposed to match ARM, which again must match somehow the DDR frequency. not sure if epiphany and DDR work well together though. I remember somewhere in the forum it was mentioned they don't, or rather didn't on the old fpga image...
Do the testing, then. The Epiphany bandwidth is 1 GB/s at 1 GHz, or 600 MB/s at 600 MHz clock frequency. The DDR should be able to handle that just fine.

dobkeratops wrote:As he mentioned separate memories scales much further.
If you mean me, I just repeated Andreas' stance: Centralized memory (i.e. caches and external DRAM) cannot scale by design, only distributed memory does (slightly simplified here). This is a very reasonable assumption, and the Epiphany architecture implements that by having a local, fast scratchpad memory.

dobkeratops wrote:I speculate Coarse x Fine x Pipelines would saturate 100's of cores just fine :)
What about 100'000s of cores?
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Mon Aug 17, 2015 2:18 pm

What about 100'000s of cores?


one step at a time.
They haven't even mass produced the 64core version yet.

*right now*, you can use ~100 levels of concurrency with existing mainstream CPU techniques (few threads x pipeline x SIMD), and its' generally harder to match this on the epiphany architecture (with todays tools).

Of course there are many potential applications of this chip.
I'm talking about my area of interest - game engines. These are data-intensive, and have complex use of data (many interacting objects), and complex programs (>=100's of klocs).
Games already leverage GPU's which scale to huge core counts across vertices and pixels - seems to be a solved problem, and resolution demands will go up for VR.

Is there space for a new device? see what I'm talking about above - I would personally like to see something like this.
instead of CPU+SIMD and GPGPU, I would like a many core throughput CPU , suitable for on the vertex and maybe even pixel workloads, as well as AI,physics,collision...

I think these have only ended up separate for evolutionary reasons.

The approach i'm talking about would be suitable for ~128 cores *with existing code*, compiled portably without too much modification. I think you'd have to compare 1 traditional core to 4-16 epiphany cores depending on context.

And in the context of games, above that, you could still use spare cores for vertex shading, if your theoretical future box replaced GPGPU with epiphany.

Or you might start writing more brute-force AI which is easier to trivially parallelise. ('try 16 different random actions, pick the action with the lowest cost..'). That would show up in my programming model as another nesting level.

If you mean me, I just repeated Andreas' stance: Centralized memory (i.e. caches and external DRAM) cannot scale by design, only distributed memory does (slightly simplified here). This is a very reasonable assumption, and the Epiphany architecture implements that by having a local, fast scratchpad memory.


Sure. And this isn't new to me - I've worked with scratchpad-based machines since 1995. Even before manycore in games we had early graphics chips accessing memory with DMA, and the scratchpad was used to avoid coherence issues and keep devices off the same bus. And we heard all this (latency/coherency) for the PS3.

caches cannot scale by design

MUTABLE caches cannot scale by design.
GPU's trivially scale 1000's of concurrent threads using caches for textures, because they're read-only. , or any write->read is clearly separated. ('1. render a shadow buffer from light; 2. render shadowed objects from camera' ).
Code is also mostly read only (just flush for new pages).
We're used to separating out immutable data now. const correctness, restrict..

Right now, if you had the ability to produce , say a 1024 core version, I would in preference choose 512 cores with scratchpad+ read-only L1 caches (maybe even just i-cache - like the PS1) and a single high level shared cache (i.e L2, like a GPU), and possibly hyper threading to hide latencies.

(Also imagine some new 'read-only/non-aliased guarantee' prefetch instructions applicable to const/restrict pointers.. I bet you could handle more mutability than today)

That would still scale - and be applicable to a larger range of problems today (hence more likely to be mass produced).
Still different to a GPU by being MIMD and having the potential fine-grain inter-core communication.

But I can see how a new language , or very complex set of tools, could let you compile software to work without any cache, as per the existing epiphany design. (i can see how groups of cores could be used as a cache , with functions e.g. instead of "read this data, apply the function,use the result.." , 'send a message to this a cluster of cores caching this collection to apply this function to this element and send the result to the next stage'.. that work needs to be generated routinely by a compiler as easily as a programmer can use a cache today by writing 'i=...; b=f(x[i]); ..do stuff using b..' 3lines of traditional code becomes 3 ELFs on the current epiphany programming model) .

We have a chicken-egg situation.

Without the chip , no one has an incentive to develop the language/tools.
Without the language/tools , no one has an incentive to use the chip.
(I know this is why they made the parallella.)
important,limited, specialised functions (e.g. graphics,video) get dedicated hardware, or can run on FPGAs

I do believe its' possible - its' just today, with current tools - the software complexity doesn't scale - which is why mainstream game consoles no longer use scratchpads.. the PS3 was the last; and they have unlimited future scaling in the GPU, where the vast majority of computational power is needed.

To date, this software challenge defeated Sony, Toshiba IBM who needed the exact same problems solved for their CELL architecture. They invested billions developing it and have given up. But, the world has accumulated experience since then.

The concept still interests me, which is why I'm here.

How much memory do you think you can get? If I read the spec correctly, we are talking of about 256 KB of memory here. On the other hand, you lose the ability to use any other FPGA together with the Epiphany, which doesn't seem really useful to me.

So imagine , out of the box, the FPGA is applied this way - accelerating arm<->epiphany communication - and if you have a better use for the FPGA, great you just shrink or eliminate that. Scalability... and makes it more useful to more people for less effort. You haven't lost anything. the beauty of caches is they are transparent to software.

I like how you think everyone else is stupid. And no, people do not turn to "source code instead of libs", they turn to intermediate, high level, assembly. Think GLSL, SPIR-V, CUDA, LLVM IR. It saves the runtime cost of analysing source code, but keeps the freedom to run it on different architectures.
[/quote][/quote]

You're both right, talking about different use cases.

Game engines have indeed moved to 'source code instead of libs' for the bulk of the complex code, because it allows templates/inlining, which is used in C++ heavily - and because there is a lot of fine-grain interaction between libraries . And people use GLSL,SPIR-V,CUDA .. to leverage GPUs for isolated tasks (across a small amount of the total source code)

My interest here is the new ground, the uncharted territory of complex code on a high throughput device.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby piotr5 » Mon Aug 17, 2015 3:58 pm

@dobkeratops: you still didn't say if you have looked into HSA. meanwhile there is a c++ compiler for it (linux only of course, windows development is as usual just company-driven, and we know how clumsy a hardware company can be when it comes to software development.) and of course there's the thounsand page manual on the low-level aspects of kaveri-apu (who's low-level language allegedly is just mimicking HSA syntax) available from AMD. watts per core is still high, but good starting point to get some development-tools done. also interesting pointer towards adapteva's competition they provided on twitter a month ago:http://www.technologyreview.com/news/539416/startup-attempts-to-reinvent-the-cpu-to-make-computers-less-power-hungry/.

and thanks for showing me the new application for sorting-algorithms. for ray-tracing it actually makes sense. (although I always complain at raytracers refusing to reuse data from previous frame for the next.) you're right, this would be useful for software-emulated cache. has anybody actually tried to create a software-emulated cache? I think the best starting point is to look at linux-sourcecode, how they handle virtual memory, and then put similar code into gcc. linus might hate cache-less architectures, he said 4 cores is about enough for everybody, but his kernel certainly has something interesting to offer on the topic of ressource management.

as for code-transformation, down to actually producing 3 binaries, I still am not convinced this is the correct path to walk. here it shouldn't be the computer adapting to bad habits programmers learned in school, instead programmers should get a habit of splitting loops up into 3 functions, one for looping, one for data-fetch and transform, and one for post-processing. and of course many more if your pipeline would have more steps for juggling with the index. you know, I learned c++ from various sources, but apart from ddd (where I learned that the letter i is a bad name for an iterator or index), I was mostly influenced by the [url=upp.sf.net]ultimate++[/url] project. there I learned that if you just introduce the "Moveable" attribute to various containers, then no garbage-collection is needed ever again. garbage collection is needed because compiler cannot touch the position of an object for as long as the object exists. but if the lib knows something is moveable, then it can be stored directly into an array, and for resizing that array you'd just move that object. there I understood, the concept of constant variables has a similar functionality, to communicate some valuable info for the various libs to use. well, it's not the libs that need such info, it's the programmers who must become aware of special properties of their objects, so they can use them in a special way. now we need something similar on a functional level, but this wont work if people keep coding a mess of loops and sequential code that knows no limits. in order to think of code in a functional way, you should see different parts of your function as enclosed entities. already the fact that you declare input and output of a function is great help in determining which data has which relationship. lambdas don't help here though, since in c++ they can have side-effects...
sebraa wrote:
piotr5 wrote:I have high respect for your experience. but times change, maybe some new ideas are available now which you didn't have back then? for example I observed an interesting trend towards usage of sourcecode instead of libs.
I like how you think everyone else is stupid. And no, people do not turn to "source code instead of libs", they turn to intermediate, high level, assembly.

well, I can only say what dobkeratops said from his point of view too: Interesting discussion.. thanks for partaking :)

of course I have to thank you for the rest of your message too, it is always good to find an opposing mind who is willing to share what is correct and what is wrong, putting up some actual facts.
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Mon Aug 17, 2015 5:09 pm

has anybody actually tried to create a software-emulated cache?

Yes;
IBM did this on Cell and gave it out as sample code. Looking at GCC options (I haven't touched a PS3 in years) it looks like it actually got integrated into the compiler too.

(I speculate on the epiphany it might be useful to make a cluster of cores work like cache-lines, for a specific vector ?)

linus might hate cache-less architectures, he said 4 cores is about enough for everybody,

I guess 4 cores is just a sweetspot for current software & machines, but that just sounds like another 'no one needs more than 640k' comment on his part.. I'm sure as years go by the core count will rise.
but his kernel certainly has something interesting to offer on the topic of ressource management.

absolutely. reading around I heard that its' actually possible to use linux to share the address space across a cluster, "single system image".. fascinating stuff.
It is just an extention of the cache-heirachy ultimately.



as for code-transformation, down to actually producing 3 binaries, I still am not convinced this is the correct path to walk. here it shouldn't be the computer adapting to bad habits programmers learned in school, instead programmers should get a habit of splitting loops up into 3 functions, one for looping, one for data-fetch and transform, and one for post-processing.


this automatic split is not what I'd start with (it's no where near low hanging fruit), but it should be an ambition for the community.

The point of computers is to make life easier - to automate chores.

Doing it manually will makes your software ~4x the size, and harder to maintain, and no longer portable . You spit one logical function between 3 separate locations, you had to create 3 names and refer between them. In practice code becomes "write once, throw away because I no longer understand it", and harder for multiple contributors to work on.

The point here is: you're losing parallelism a normal CPU gives you for free with OOOE, Superscalar, Pipelines.
The best practice will depend on the platform. on a traditional chip, this code *should* be done in one place.
It will even vary from one generation to the next, e.g. if they move from 32k to 128k the optimum split might change dramatically.

you could also profile it to determine what the best split is, or the best split might change as the program evolves.

Per-platform differences should be handled by the compiler, otherwise your code becomes a mess of #ifdef special cases and becomes unworkable.

Less software, longer development time, less users, less efficiency of mass production... bye bye architecture.

Just automate it :) put the time into the compiler, then increasing amounts of real world 'in-use' software would become useable. The compiler should make a best guess, then you go in and manually supply hints in troublespots if you know you can do better - (if you're lucky enough to have time to spare..)

I can imagine just writing more 'high order functions' to deal with it, but even that is a barrier between one programmer and another(agreeing what to call them? looking up the right one to call? the complexity of naming & searching sourcecode explodes).
Code: Select all
'par_map_with_2_stages_and_remote_call_in_the_middle(collection, first_stage, second_collection, middle_remote_stage, second_stage)



there I learned that if you just introduce the "Moveable" attribute to various containers, then no garbage-collection is needed ever again.

yeah this side of C++ is great... unique ownership , move-semantics. Rust copies it and gives slightly better defaults. Of you can do this sort of thing manually in C, its' just easier to make mistakes and you must write more.

Some people claim garbage collectors are ok, but surely on a manycore machine it would be a nightmare tracing ownership through all the states in flight..

if people keep coding a mess of loops and sequential code that knows no limits. in order to think of code in a functional way, you should see different parts of your function as enclosed entities. already the fact that you declare input and output of a function is great help in determining which data has which relationship. lambdas don't help here though, since in c++ they can have side-effects...


yes unfortunately C++ is not perfect. Thats' why I set about writing a pet language :) (which of course has its' own problems..)
When I say 'lambda' I really do mean PURE lambdas (or at least with restricted side effects known to work in parallel). To be ubiquitous, parallel iteration needs to be as simple to write as for loops in C.

It's just a case of making do as best as possible at the minute.
Rust was encouraging because it has immutable default, but the details of how its' lambdas STILL break down in the parallel case (you have to use an unsafe hack to pass it, so it's no safer than C++). It also used to have a special syntax that made a lambda look like a loop body.

For the minute, we just have to accept it as a hazard to verify with testing..

the next step might be some compiler plugin to warn you about impure lambdas, or continuing to pester the standards committee until they add a 'pure' keyword of some sort..
Last edited by dobkeratops on Mon Aug 17, 2015 9:17 pm, edited 1 time in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby theover » Mon Aug 17, 2015 7:03 pm

I stand by my answer. It might be fun to some people to entertain an amount of raving lunacy, but when shove comes to push, you need technically sound answers, and implementations that work. The tradeoffs about these kinds of subjects and architectures were mainly done in the 60s and 70s, there's little honor in straightforward tradeoffs about cache schemes and basic policies, and there's statistics you can do on that, incorporating bandwidth considerations that can work, but are limited in their true analytic and practical value. It's fine to talk about, but there are other reasons for the way things are.

Engineering time and cost, testing constructs for which there is no good (kernal) software, and in the case of the3 Parallella, there's the proof of making a low energy machine with good educational value.

Redoing electrical engineering from the 80s about computer architecture is nice, but that game requires you yourself play too, so, make something ! MIght be fun.

T.V.
theover
 
Posts: 181
Joined: Mon Dec 17, 2012 4:50 pm

PreviousNext

Return to Programming Q & A

Who is online

Users browsing this forum: Google [Bot] and 9 guests

cron