generating epiphany code from templates, possible?

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Mon Aug 17, 2015 7:27 pm

I stand by my answer. It might be fun to some people to entertain an amount of raving lunacy, but when shove comes to push, you need technically sound answers, and implementations that work. The tradeoffs about these kinds of subjects and architectures were mainly done in the 60s and 70s, there's little honor in straightforward tradeoffs about cache schemes and basic policies, and there's statistics you can do on that, incorporating bandwidth considerations that can work, but are limited in their true analytic and practical value. It's fine to talk about, but there are other reasons for the way things are.


whilst a lot of basic theory was established, things have changed - the types of applications we write, the contexts in which computers are used, the complexity of compilers, the complexity of software, the number of programmers in the world , the amount of lines of code ..

It used to be considered absurd that you could use a compiler for high performance software (I remember because I grew up writing assembly) - but now you can.

but when shove comes to push, you need technically sound answers, and implementations that work

I stand by my assertion - the point of this board is new tools. there are solutions we are yet to realise.

I can tell you from experience in a domain where something similar should have succeeded, the existing software approaches are NOT sufficient.

On a traditional architecture, the point where you spawn a thread is given a function pointer: the total reachable code & stack size should be analytically determinable - a compiler could tell you if its' efficient or not to run that (i.e. if it can all fit into one localstore without paging)*.

Using templates you can separate the messy details from your actual processing - you still consider them, but you can change them independently and importantly *per platform*. You can make your software easier to maintain and tweak.

Doing this requires that the compiler can generate the boilerplate by inlining (driven by code you manually design, in headers..- including managing local DMA buffer & transfers..., whatever it takes), hence you don't have a function to manually move into a separate elf and point at as 'main'. (One high order function could manage an entire chain , with several elf's, e.g. 'par_pipelined_map(src, stage1,stage2,stage3)', whatever..).
I usually don't like complex abstractions but for this sort of thing they're the lesser evil e.g, setting up a double-buffered DMA loop with prologue & epilogue.. the same pattern is going to be repeated so many times. And removing paging mechanisms from hardware - you'll find yourself needing to make similar software for large collections (parts of it will migrate between different physical locations, even though you'll want to consider it logically as one entity - and you might want to implement it differently for the special case where it does fit entirely in one localstore).

if you can't write parallel code this easily - (as easily as you can now on mainstream SMP machines, which are everywhere) the chip will not find the range of customers & use it deserves, and - as seems to have happened with the 64 core version - it wont get the efficiency of mass production.

People might do things manually for supercomputers but these are a small market compared to mass-produced, consumer devices - and don't these tend to use versions of architectures that still rely on mass market efficiency (albeit with special boards and interconnects).

(* and this is why I put the idea out there of a JIT/codemorpher. You could trace the call-graph there, and assemble half a scratchpad's worth of code.. but it would be better to do this at compile-time, surely.
This might 'cost' many man-hours to write, but I presume this is cheaper than designing & producing a new chip & board, spread over the community this 'cost' should be cheap.

I talk about the middle ground of an i-cache - if they might be considering how to extend the architecture, which shouldn't have the coherency problems of d-caches, it seems this would vastly simplify the tools/leverage existing approaches.

You could imagine a migratory path - where we go from (i)cached-multicore to (ii) icache+imutable-cache+scratchpad+shared L2 (a bit like a GPU) -> (iii) uncached, once there's enough software to demand it .... and you wouldn't be stuck with the same level of pain the CELL caused, whilst still getting a large fraction of the benefit)

Redoing electrical engineering from the 80s about computer architecture is nice, but that game requires you yourself play too, so, make something ! MIght be fun.

I understand the coherency tradeoff - I've seen a full range of machines on this spectrum (various mixes of scratchpads, i-caches,d-caches, DMAs), and the rationale behind them was explained.

The mainstream situation is as follows: after trying out various options, the industry consensus was that DMA/Scratchpad complicates software too much & more importantly prevents code being portable. ( Hell, they don't even bother with the full range of cache-control instructions anymore. ). Applications in this domain are too complex and the range of possible hardware too varied to narrow your choice.
The industry has settled on SMP + GPGPU, which is easy to scale and port.

But I think this is a failure of current software approaches, not a failure of hardware.

C is a good portable assembler - and worth having - but not versatile enough
C++ is better - gives you more ways of creating abstractions to generate code - allowing you to port & tweak implementation details
CLANG AST and Lambdas, 'auto' fill in some possibilities that we desperately needed back int 2006..
'haskell' is the closest example to 'how it should be done' but suffers from garbage collection.

no suitable off the shelf tools exist , because the devices are too 'niche'.

My best guess is to persevere with C++, use an appropriate subset, and layer things on top of it.. (but I know what the ideal language should look like)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby piotr5 » Tue Aug 18, 2015 2:20 pm

templates generating epiphany-code is the topic. so let's look at an actual example: cout object and its friends.

if the part you want to have compiled for epiphany contains cout, what should happen? ideally then linux should create a new terminal for each epiphany-core this code is running on, and the cout object in context of parallellized loops or whatever should send instructions to arm which message with which parameters should be sent to the newly created console.

of course this means you'd need to abandon the stdlib implementation of cout, you can't use a global object for that. but on the other hand, object-creation on the epiphany would need to cause activity in the linux-kernel (creation of a new device).

another lesson that can be learned from this example is, apart from local and shared memory, parallell architecture also has hidden memory to cope with. you said triangles are stored on GPU in games, this is a good example of such hidden memory. gpu has its own address space, accessing it is done either by a mirrorred virtual address-space or by sending messages to the device. either way, this memory should be considered to be readonly and writeonly, no read nor write access to that memory is performed, instead its memory-address is just passed to the device for moving around or transforming or whatever. it's a bit like playing chess through a key-hole.

so judging from your description this kind of memory has a name, what is it called? how do you call pointers to memory-locations inside a gpu? how do you call pointers stored inside the gpu which are pointing to the cpu's address space? imho the first object we need is exactly this kind of pointer implemented in c++...
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Wed Aug 19, 2015 4:54 pm

piotr5 wrote:if the part you want to have compiled for epiphany contains cout, what should happen? ideally then linux should create a new terminal for each epiphany-core this code is running on, and the cout object in context of parallellized loops or whatever should send instructions to arm which message with which parameters should be sent to the newly created console.

of course this means you'd need to abandon the stdlib implementation of cout, you can't use a global object for that. but on the other hand, object-creation on the epiphany would need to cause activity in the linux-kernel (creation of a new device).


Yes; you'd have to be careful about what you put on the cores, but I don't worry about this as a hazard.

By templates, I mean the language mechanisms for creating new abstractions - not the full existing C++ standard library. Nonetheless being able to toggle 'debug' and 'release' builds (debug would just make normal threads with access to cout..) is very helpful.

I have seen how C++ abstractions can be applied to both Data-Parallelism & DMA already. So the only barrier to creating portable,parallel, data-flow abstractions is: the current need for the programmer to manually do the work of the i-cache with separate ELFs
'
another lesson that can be learned from this example is, apart from local and shared memory, parallell architecture also has hidden memory to cope with. you said triangles are stored on GPU in games, this is a good example of such hidden memory. gpu has its own address space, accessing it is done either by a mirrorred virtual address-space or by sending messages to the device. either way, this memory should be considered to be readonly and writeonly, no read nor write access to that memory is performed, instead its memory-address is just passed to the device for moving around or transforming or whatever. it's a bit like playing chess through a key-hole.

Yes; read/write access hints known at compile time are very important, and critical for parallelism.

For this reason C/C++ has the combination of const and restrict - a hint for non-aliased pointers. (C++ has additional rules that different types do not alias). When it's known that a const pointer is non aliased, its' safe for the compiler to cache data from memory in registers. (I would like to see C++ get a 'write-only' pointer hint too. these are separate battles..).
In console dev we were taught to use restrict to maximize the use of registers and increase compiler scheduling opportunities (vital on the older in-order processors); thats's just InstructionLevelParallelism of course, but the same principle applies.

I speculate it should be possible to get some automatic scratchpad use from a compiler using it similar to the register file (like a register for a whole struct..). However my plan is to use high-order functions that describe the data-flow, so DMA can be done in the epiphany version of the template definition.

GPU's can do massively parallel reads of large data structures (vertices & textures), through caches, because it assumes they're read-only for a given batch of work, clearly separated from writes ( the framebuffer)

(Rust is nice having 'const/restrict' as its' default, much better, you mark the mutable pointers instead)

I believe GCC already has non-standard extensions for global purity hints (i.e. marking the globals as 'const'), adding those to the C++ standard would help.

For the minute we have to accept this as a hazard for testing, and something to communicate between programmers ..
Code: Select all
//this abstraction only works for pure functions'


gpu has its own address space, accessing it is done either by a mirrorred virtual address-space or by sending messages to the device.


Consoles,SOC's (and the AMD APU architecture) tend to have Unified memory. Game Console SDKs give you slightly lower-level APIs (instead of openGL/directX) that expose this. (some have edam but thats more like a giant gpu scratchpad, i guess.. it's used for temporaries)

This is much better, IMO. You are right that we must consider separate address spaces for PCs.

For GPGPU they try to make it more seamless these days with virtual memory paging as you mention, e.g. nvidia unified memory.

you have complex spatial data (e.g. scenery polygons) - you want them manipulated by 'update' code (available to physics & AI), and read by 'render' code.

I believe the correct model is to just have 1 address space accessed by 'big and little cores', which are all on a level playing field. The epiphany should be thought of as 'parallel inner loops', not a separate coprocessor or accelerator. It should target the same space as intel Knights Landing'. The finer grain fork/join of parallelism is what distinguishes this from GPGPU - the ability to use parallel results immediately in your complex CPU code.

I know that NUMA setups & Single-System-Image can have separate memories with logical pages shifted between separate physical memories.. thats' intriguing aswell.

It would be nice to flag entire pages as 'immutable', e.g. previous frame state is const; generating the next frame reads it, and writes to 'write-only' memory, then these buffers swap..
parallell architecture also has hidden memory to cope with.


I realise the current board has this 32mb shared window, for reasons of what works with the current kernel & arm SOC.
I think they have options for improving on that. I wonder if the FPGA would be sufficient to implement an MMU to have it use the same logical addresses as linux.. failing that I gather there's a cut down version of the linux kernel for machines with no MMU ('uLinux' or something?).

I think with the current board they could expose multiple 32mb banks, which would be a step forward. I can also imagine a fancy custom indexed DMA channel in the FPGA doing 'scatter/gather' work anywhere in DDR with epiphany cores processing in the middle
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby piotr5 » Thu Aug 20, 2015 12:55 pm

the unified memory approach is a dead end. when each core has its own cache (as it does on the new APUs), then writes to shared memory require the respective cache to be updated. so, the more cores you have the more complicated such cache update will become, exponentially so. also, variables or even objects stored in regisers have no actual memory-address. in many ways you get variables introduced which are not visible from other cores! so even with all the MMU magic, you still need an abstract object encapsulating all the different kinds of invisible variables!

speaking of the const keyword in c++, there's also the atomic keyword. it comes with some interesting functions:
Memory synchronization ordering
memory_order (C++11)
defines memory ordering constraints for the given atomic operation
(typedef)
kill_dependency (C++11)
removes the specified object from the std::memory_order_consume dependency tree
(function template)
atomic_thread_fence (C++11)
generic memory order-dependent fence synchronization primitive
(function)
atomic_signal_fence (C++11)
fence between a thread and a signal handler executed in the same thread
(function)

ever looked at them and what they do?
dobkeratops wrote:I have seen how C++ abstractions can be applied to both Data-Parallelism & DMA already. So the only barrier to creating portable,parallel, data-flow abstractions is: the current need for the programmer to manually do the work of the i-cache with separate ELFs

I really fail to see the complexity here, of which you claim it must be eliminated. what's wrong with using external tools or compiler-plugins? and if you already accept that, why put the programmer into the delusion the code would run on sequential cpu instead of parallell execution? the reason I put my loops into headers is to remind myself, they are not actual loops on all platforms! if the loop-body isn't using info from previous iteration, both the compiler and the programmer need to know that. therefore loop-body must be in a separate function! if I put a loop in the sourcecode to be compiled, then it's already a loop with body split off, maybe an iterator or something is performing the loop body. this way the program is easier to comprehend by the compiler. and if I use boost::coroutine it actually becomes comprehensible by the programmer too...
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Thu Aug 20, 2015 1:30 pm

the unified memory approach is a dead end. when each core has its own cache (as it does on the new APUs), then writes to shared memory require the respective cache to be updated.


I don't believe this is so.

GPU's successfully read large data structures through caches, with thousands of concurrent operations - and they do this successfully in game-consoles with unified memory now.

All it takes is clear division of *reads* and *writes*.
This can be done by flagging memory pages as read only. In graphics, its' 'read textures/write framebuffers'. In general purpose programming, it could be done with pure functions (read inputs, return output - a pure function does not read its' own output, or modify its' inputs, therefore outputs & inputs do not need cache-coherency).

Did you know some CPU's have 'allocate-zerod'' cache-control hint instructions (mark an entire line as zero.. allocates workspace in the CPU, without needing to read state from DRAM), and 'cache-line invalidate' (discard a cacheline hinting it doesn't go back to main memory), and write back hints.

Also remember caches are hierarchical. The APU benefits from sharing data between CPU & GPU, in the high level caches. This lets them work more closely on a problem, sharing temporaries without going to DRAM.(e.g. the CPU could buffer a load of requests for the GPU, and get results back in a queue)

Its' only the L1 caches that suffer from coherence problems - & even only if they're writable, and you haven't given it a hint that they don't overlap.

The outermost cache does not have the coherence problem -there's only one. (It can be used to share data between cores: the address mapping mechanism does the work of getting the right data to the right place.)

In console dev data for GPU's is usually placed in pages where the CPU cache is disabled - but write buffering still works. Its' effectively 'write only for CPU, read only for GPU', and also eviction hints ('write this line back now')

So just generalise that - give individual cores cacheable read-only windows, whatever.
This is workable because so much data in games is immutable most of the time.

People even talk about double-buffering the entire game-state, so that it can be read in parallel.

So instead of ditching caches - we should just be adding new hints.

ever looked at them and what they do?

are these library functions, not keywords? I haven't looked into them - I've dealt with building my own abstractions using OS level threads, and some atomic operations to arbitrate (e.g. worker threads consulting a shared 'next' index, .. etc). We needed this long before C++11 (back in 2006). C++11 and C++1y make it all easier with polymorphic lambdas. I don't take the standard libraries as 'gospel' , I'm more interested in tools the compiler gives me.

There is a big difference between *concurrency* and *parallelism*. My interest is the latter, data-parallelism works extremely well for game-engine code IMO. Threading helpers in the C++ standard libraries might be targeted at concurrency?


I really fail to see the complexity here, of which you claim it must be eliminated. what's wrong with using external tools or compiler-plugins?


Its' the complexity of creating separate ELFs for things that can be statements written inline, with lambdas, and' the programmer manually doing the job of the i-cache. On a traditional machine you get statement level parallelism. To match this in Epiphany, either the compiler or programer must do this explicitely. If its 'the programmer', the complexity of source code explodes. We have seen this with CELL.

With SMP your entry point for a worker thread can be a function-pointer, generated within a template instantiation. How do you do that with the epiphany???

As it stands today, you can parallelise a C++ source base for SMP *far* more easily than working for something like Epiphany or CELL.

8 cores that you can actually use everywhere might be better than 64 cores that find minimal use.. but even then its' not a fair comparison. One intel chip has hundreds of instructions in flight, and SIMD.


At risk of repeating myself :) this complexity is what killed the CELL, why its' not in PS4. Sony were all set to scale it - just add more cores - and developers rebelled. (They sent Mark Cerny around , I remember the meeting at our office - I was actually saying "CELL is great, it just needs a new language" .. I mentioned erlang,he mentioned haskell.. but 4 of my colleagues contradicted me and just said "please don't use something like CELL again...".
This was the story the industry over. They took the hint and didn't repeat the mistake.


Maybe 1 higher end ARM core or Intel core could be the equivalent of 4-16 epiphany cores. You'll have to saturate 32-64 epiphany cores to compete? And if you have an obviously parallel workload, GPU's handle that for 1000's already. Its' a powerful combination.

The problem is programs have to be very fluid. If your algorithm & behaviour is mixed in with too many technical details, you can't modify or maintain it. Programming becomes a 'write-only' process. But with C++ on shared-memory (far more so with polymorphic lambdas), you can build abstractions that separate out the messy details.
you might think you can plan ahead but demands change, and often you are inherently trying to do "what hasn't been done before". Source code has to keep options open.

Cross-Platform is a huge issue too.

Seriously, the more work the compiler can do - the better.

Continual improvements in Software lets us work on problems too big for our own minds - and this is equally applicable to Software itself. Hence the evolution of programming languages. C++ as it stood in 2006 was definitely not sufficient. Now we have polymorphic lambdas. There's still more features I want.

why put the programmer into the delusion the code would run on sequential cpu instead of parallell execution?

The magic of lambdas is: they give you parallel abstractions *at statement granularity*
compare
Code: Select all
// serial
for (auto & d : data) {
    ..do stuff with 'd'..;
}

// does the same as above, but in parallel -
parallel_for_each(data, [](auto& d){
    ..do stuff with 'd'..;
})

The latter can be done all over the place, just as easily as you write a loop. They can be nested too. Minimal 'noise' in the program.
But what does that look like on epiphany today ?? you had to create and name a separate file! (I don't mind the definition of 'par_for_each' being separate, because thats' re-useable, and ubiquitous - the separation makes it easier to see whats' going on.).

why put the programmer into the delusion the code would run on sequential cpu instead of parallell execution?

its' not about 'delusion' - its' about the ease of expressing something, then the ease with which you can change it (e.g. changing the granularity of parallelism based on profiler feedback,and changing the program in response to changing demands - in your quest to create something unique), or finding more parallelism as you refactor your data gradually.

Parallelism really is *everywhere*, at the level of instructions, statements... OOOE+Superscalar extracts it at runtime, compilers do scheduling (you maximise it with loop unrolling on In-Order chips), SIMD does it....

By dropping the 'big core' features , with the current epiphany C programming model, to fully exploit it you'd need to make elfs pretty much for every function, every loop body.. its' 'mechanical' work that should be automated.

The principle of 'little cores' is to 'shift complexity from the processor, into the compiler' (same strategy as VLIW)

At the minute, without a sufficient compiler, that burden is put on the programmer, hence the lack of adoption.. limited software ecosystem. Chicken/egg. No software->no hardware. No hardware -> no software.

The software ecosystem for mainstream processors has grown with incremental improvements over *decades*.

therefore loop-body must be in a separate function

To clarify - loop body, or definition of 'loop',
Loops can be replaced with *high order functions* - functions taking functions. Its' the *high order function* that goes in the header -, #ifdef'd per platform.

The loop *body* stays in your program. As shown above. You build a library of reusable parallel constructs. Replace 'for' with 'parallel_for'. Much like OpenMP, but with more control (ability to create your own abstractions in headers). The way you write programs is composing abstractions.
It really should be that easy.

It IS on SMP, and with enough work on the compiler -it could be for epiphany too.

These 'high order functions' can encode the necessary data-flow hints. (e.g. no random read/write access to any collection.. some well defined flow). the usual one is complete purity (read the inputs to produce an output) , but you could still parallelise an 'update' (modify something in place) so long as you don't randomly read.

The barrier here is the inability to spawn a task from a generated function pointer.

The call graph is known at compile time. This *should* be solveable in the compilers. "software managed i-cache"? compile one large program, pre-assembled into 16k code pages based on the call-graph? keep a global-to-local function pointer translation table in each page. Yes that sounds like a Mess, but this is the cost of NOT having an i-cache... you pay it somewhere else.

If no such compiler appears then we're waiting for a many-core processor with at least an i-cache - and until then the CPU+GPGPU combination will win
Last edited by dobkeratops on Thu Aug 20, 2015 9:17 pm, edited 1 time in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Thu Aug 20, 2015 9:08 pm

Very interesting, but strange that no one mentioned this

https://github.com/adapteva/epiphany-sd ... re-caching

It seems there is an attempt to provide flat addressing of program code - a small runtime doing the job of an i-cache.
I wonder if this would be sufficient for some of my use cases.

It sounds like this works on a function-by-function basis.

I think an ideal scheme would want to cluster functions into pages, and distinguish intra/inter page calls. Then you'd only need the cache-management overhead for calls between pages;
maybe replicate some commonly-used leaf-functions in multiple pages to increase locality and decrease managemtn-overhead. A Page or 'precalculated cache' could be described as references into the main program ).

This still wouldn't solve the problem of code split between epiphany & ARM though.
it would be sufficient if your machine only had a small ARM for 'compatibility' and the bulk of the silicon was devoted to epiphany-cores, and if they could address all of DDR memory.

The main case that interests me is 'start this core, given this global function pointer';

It might be possible to trace the call-graph at a task entry point- if it fits within 16k, assemble a precalculated page; otherwise use the software-managed-cache.

Most performance critical stuff should be sufficiently small to fit in 16k of code, but obviously having the program dynamic would make development so much easier.. and the key thing is eliminating the manual allocation, and enabling template generated code to stradle multiple cores. (e.g. template generates management code on the 'current' core, and the worker-'thread' stub code on subordinate worker-cores - and that could be nested, for multiple levels of fork-join parallelism)

I think this would be worth putting time into
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby piotr5 » Fri Aug 21, 2015 8:13 pm

imho there is one detail wrong in the software-caching and generally in the compiler's approach towards optimizing cache-use: as a programmer I'd like to get an error-message if my program doesn't fit into the cache, like we do on epiphany. if I know the compiler has difficulties, maybe a small tweak could improve speed and readability of my program at the same time? if my function is too big to fit on epiphany-cores, then my function is badly coded!

dobkeratops wrote:its' about the ease of expressing something, then the ease with which you can change it

I think it's the other way around. when looking at what awkward languages there are out there, programmers are willing to put stuff into completely incomprehensible languages, far far away from the natural human languages, if that language only offers some easy way to change the code without losing the original ideas. i.e. students don't care what language they use, as long as it allows for some kind of object-oriented approach and modularity of functions. so the ease to express something seems in reality be of a slightly lower priority than ease of change (of course only after those students gained some experience with more complicated projects). that's why I think we should get rid of the whole idea of computer-programs being a sequence of loops and conditionals held together with a few data-manipulations. why can't data be the driving force behind a program, and the loops or conditional merely the glue putting data into an actual structure?

as I understood you'd like to compile loop-body into a data-set that gets uploaded on epiphany cores and started. as I understood this is also the approach how gpgpu works. I agree that from point of view of arm epiphany programs are just data. but why must they be stored in the executable? I can understand why you want to put loop-body close to the actual loop. but now we have iterators and associated iteration-conditionals abstracted out from the loop, so why not put loop-body elsewhere too? however, I agree, putting loop-body nearby the loop should be a possibility. many programs already are designed that way, no need to change them all. unfortunately, what you suggest cannot allow separation of loop body from loop because a function that exists only in binary format at compile-time and is linked in at link-time, such a function cannot be retroactively transformed into a parallella-program. i.e. you'd need to mark that function for e-compilation, and as I pointed out with the cout example it really creates a lot of additional problems (context-oriented programming).


as for flat memory, I didn't say throw away the caches. amd's apu has 512 simd cores organized in 32 wave-fronts, each wave-front has its own cache. suppose the caches share a single communication-line, each cache announces changes in memory to the others. that's potentially 512 such messages! obviously it can't work that way. instead more likely the caches know of eachother what memory they have shared, but this approach isn't scaleable, unless you merely increase the simd-width and not the wavefront-count. that's the secret of gpu: each wavefront can affect only a limited set of data, you can't have one end access the other end's data because all pointers are in-/de-cremented in parallell. add another wavefront and you add new problems for your hardware. how many wavefronts can you have running at the same time on nvidia chips? do they share memory? how many mimd-cores do graphics-cards have, when you count each wavefront as a mimd processing unit?

what you said about additional cache-management, preventing reads or preventing writes, that's exactly what I meant when I said that by going parallell you have to cope much more with hidden memory areas outside of the flat shared memory!
so my suggestion is to write something alike to a shared library. i.e. you mark a function or lambda as designated for epiphany and some dynamic loader then schedules those functions onto epiphany, maybe multiple functions at once to evade epiphany waiting for new uploads. then the compiler puts machine-code and relocation-table into the binary or into a separate e-Lib. from the programmer's POV these lib-functions then are data that needs to be put on epiphany and started. the only change in the compiler then is some epiphany-attribute for functions, it'll create new pointers in addition to the ones for functions on arm, maybe with an __e_ prefix to the function pointers. mind you, neither does compiler do the cacheing nor whatever dependency-tracking. the functions I listed are of interest here because they are an attempt to organize dependency-tracking!
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Fri Aug 21, 2015 9:31 pm

piotr5 wrote:imho there is one detail wrong in the software-caching and generally in the compiler's approach towards optimizing cache-use: as a programmer I'd like to get an error-message if my program doesn't fit into the cache, like we do on epiphany.


I agree with some of this but disagree with the whole statement.

You still want the option of overspill, as part of the development process.
And you don't even know if paging might be more efficent, by keeping low-frequency code out. The problem with a fixed program area is, different parts of the program are used at different rates, but they consume the same premium resource.

you don't just sit down and write 100,000 lines according to a plan.

It's an iterative process - you learn about the problem as you go..

There should certainly be a build option to spit out such a warning, or another level to make it an error.
and yes, when everything is working well, I think you could replace SoftwareManagedCache with 'precompiled clusters of functions, paged on demand, without the software-cache testing overhead between calls in the same page.
1 source file will generate several pages. Inner loops will generate separate pages.

I want the software caching scheme to be more elaborate for sure.

if I know the compiler has difficulties, maybe a small tweak could improve speed and readability of my program at the same time? if my function is too big to fit on epiphany-cores, then my function is badly coded!

Here I strongly disagree.

[1] Concretely, the actual size for Epiphany is arbitrary. Different machines have different size caches and local-stores. CELL had 256k. most CPU's seem to have 32k i-cache, backed up by L2 for overspill (does the entire stack frame have to fit in the cache? no! only the tip is used rapidly.
The lack of runtime management might *over constrain* program size. Part of the big deal about epiphany (vs GPU) is: 'divergent flow' - which can mean a split between high-frequency & low-frequency code... its' not a "bad program" if you got that wrong. this is just a matter of empirical tweaking.

So if your code fits in 256k CELL SPU, but not in 32k epiphany, its' suddenly "bad"?
no. thats just an arbitrary per-platform constraint.

you need the same program to compile for different granularities.


The fact epiphany has such small local memories is part of why I want automatic division, eventually - i think the chip they have today will need unusually small tasks. but thats' temporary.

its very likely future versions of the Epiphany will increase . Their road map talks about 1mb local-memory eventually..

So what are you going to do... manually re-shuffle your functions for each version?!
This is like someone convinced you you have to suffer, and you accept it.

I'd start out splitting a bit more with lambdas, *but written in one place*, and the ability to toggle easily change it to compile several in one place (or split into a pipeline..). e.g. update(data, [](){stage1; stage2; stage3}); -> update_3(data,[](){stage1},[](){stage2},[](){stage3}); // its' all still in one place, but the abstraction can change its' implemenatation to merge those stages on one core, or setup a pipeline on 3 cores, depending on the local-memory size..

[2] The most logical flow is not always the most optimal implementation - so we need tools that can move code around for us.
Usually optimisation REDUCES readability. the more the compiler does, the better.

its' for a similar reason C++ libraries use a lot of header files, code generation & inlining - the best boundaries for abstraction and composing libraries do NOT always follow binary lines.

Seperation adds management overhead - naming and referring to entities. Writing things 'inplace' avoids this complexity

Jonathan Blow explains this very well in his 'language for games' talks on youtube. (its' worth watching this series, but I've got a very specific point he makes in mind here). Sorry for trying to 'argue from authority' - its' just he explains it better than I can, with a well presented video, probably more convincing than reading my ramblings here.

Specifically he mentions how he was taught 'small functions are good' in school, and in practice he discovered the opposite. He says, the placement of a unique piece of code in a big function expresses useful information - whereas seperation adds useless information (i)an arbitrary name, (ii)and new uncertainties about the state for which that code is valid. Even if you're talking pure code - you added more overhead to express it - subsets of pure functions can be impure, and vica versa. Making *everything* pure overcomplicates it too - see all the state-monad stuff in haskell. (I gave up with that for the same reason I gave up with OOP)

I agree with about 90% of what he says, but disagreed with the need for such a radical departure from C++.. I believe a replacement can start with a re-syntaxed subset.

[qupte]
dobkeratops wrote:its' about the ease of expressing something, then the ease with which you can change it

I think it's the other way around. when looking at what awkward languages there are out there, programmers are willing to put stuff into completely incomprehensible languages, far far away from the natural human languages, if that language only offers some easy way to change the code without losing the original ideas. i.e. students don't care what language they use, as long as it allows for some kind of object-oriented approach and modularity of functions. so the ease to express something seems in reality be of a slightly lower priority than ease of change (of course only after those students gained some experience with more complicated projects).
[/quote]

I don't actually like OOP languages; I like multi-paradigm. I want both: easy expression, and easy of changing.
(forcing things into classes , for me, is unnatural - i prefer to be able to think in terms of functions and data, and be able to move them around independently.

The goal is both. Easy to express, easy to change. The part i'm complaining about here is a manual split, analogous to the manual work of moving between stack & registers which hardly anyone does anymore, compilers are good enough. (you can just code in C, and inspect the output). Its' vastly more productive to put time into the compilers' register-allocation strategy than it is to code manually in assembler.

programmers are willing to put stuff into completely incomprehensible languages, far far away from the natural human languages,


I think here you're just talking about 'badly designed langauges'.
I accept nothing as gospel - everything is in a state of evolution.
I accept people have to use 'whatever is available now'. But we must have a vision of where we're headed , in order to improve..
what we have today is only "the least bad option", not how it SHOULD be. 5,10,20 years from now, it will be different, and we need to make it..

When things are clear, and un-obfuscated, they are easier to change. Good information is things like 'const correctness'. Bad information is things like manual memory allocate/free ... better when RAII can auto generate that automatically paired in scope-blocks, from constructors/destructors..

that's why I think we should get rid of the whole idea of computer-programs being a sequence of loops and conditionals held together with a few data-manipulations. why can't data be the driving force behind a program, and the loops or conditional merely the glue putting data into an actual structure?

yes I agree with this, thats compatible with what i'm on about. data transformed by functions. high-order functions, applying a function to a collection.. (and its' very convenient to have anonymous functions to do this). Not OOP. but polymorphic along the way, makes it easier. UFCS allows a.foo(b) syntax for functions, this syntax makes writing data-flow easier. a.foo(b).bar(c).baz(d). a --> foo --> bar ---> baz --> result

I talk about loops because thats' what's always been easiest to write in C. When programmers write loops, its' usually (in function terms) some pattern like 'map','filter', 'zip', 'reduce' etc. Loops are still sometimes easier to write than 'finding the right helper function..'. (Another subject is shape-analysis, automating that search..).

So if we accept a loop can be written such that its' body is parallelizable, its' easier for me to write "loop" than "high order function", .. "internal iterator". we might just be arguing over names here. 'a block of code that will be run several times'.

With modern compilers, we don't think so literally. We rely on the ability of the compiler to perform all sorts of optimisations so we can make better abstractions. e.g. Its' perfectly valid to see a trivial loop and assume the compiler can turn it into SIMD. Some compilers will change repeated index addressing into pointer incrementing for you, and so on.

All i'm talking about here is the ability to write parallel abstractions as easily as you can write loops, without extra noise, without needing to create separate files & intermediates, without needing to faff around with makefiles to do it..

lambdas & high-order-functions achieve that. They're awesome.

( but it could be loops. Naming these abstractions is yet another problem, to which I believe the solution is Shape Analysis, but thats' completely orthogonal to this problem )

as I understood you'd like to compile loop-body into a data-set that gets uploaded on epiphany cores and started. as I understood this is also the approach how gpgpu works.


I agree that from point of view of arm epiphany programs are just data. but why must they be stored in the executable?


because they are an integral part of your program.

And its' not necaserily 'stored in the executable' that I'm after - rather 'generated from the same source, at fine grain', without you having to think about it.

Logically they occupy the same place as "the inner loops". They should be as easy and straightforward to write as for loops in C.

When this is possible - we will get ubiquitous parallelism.
We don't have it today in source, because it's usually done manually, so OOOE finds it for us at runtime. But you still 'improve the parallelism' to increase the effectiveness of OOOE. (on in-order chips.. you do it manually with unrolling)

I can understand why you want to put loop-body close to the actual loop. but now we have iterators and associated iteration-conditionals abstracted out from the loop, so why not put loop-body elsewhere too?


because you have to name an extra entity. Its' an unnecessary entity. You're making the program bigger, by adding that reference.

Its like having to name individual statements.

It increases the amount of manual work to do something simple. It might literally be one statement., and we already get those in parallel.

this can be automated.

If i've understood what the software caching example does, it might actually be possible now with the epiphany SDK for one e-core spawning another, but not yet between ARM and epiphany.


The purpose of computers is to make life easier, and that applies to their own evolution too. By automating tasks, computers let us achieve things we can't achieve with our own minds, that includes improving computers & writing their software.

however, I agree, putting loop-body nearby the loop should be a possibility. many programs already are designed that way, no need to change them all. unfortunately, what you suggest cannot allow separation of loop body from loop because a function that exists only in binary format at compile-time and is linked in at link-time,

Just like in 2006 i could see "we need lambdas", the lack of lambdas wasn't because they're impossible. Its because no one got round to adding them to C++ yet. It was a long slog, but eventually we got them, and C++14 was the icing on the cake with Polymorphic lambdas. I would have ditched C++ by now if that hadn't of happened.

Its' the same here. The epiphany tools must make spawning a task as easy as it is in C++. I am happy to mark hints for caching.. whats' immutable, whatever. But not un-necasery work of creating extra manual bureaucracy,verbosity for its' own sake, when it can be inferred.

such a function cannot be retroactively transformed into a parallella-program. i.e. you'd need to mark that function for e-compilation, and as I pointed out with the cout example it really creates a lot of additional problems (context-oriented programming).


This is a tractable problem- the information is there, 'from this point in the call graph, compile for the other processor'.

as for flat memory, I didn't say throw away the caches.

sure, just generally here people have gone from one extreme to the other. An i-cache is explicitely read-only. I think that actually allows more efficient internals. I'm hearing "caches don't scale", when GPU's prove they do for immutable data.

amd's apu has 512 simd cores organized in 32 wave-fronts, each wave-front has its own cache. suppose the caches share a single communication-line, each cache announces changes in memory to the others. that's potentially 512 such messages! obviously it can't work that way.

exactly, it doesn't. It works because the majority of inter-process communication happens in the L2. the internal caches are predominantly private (for locals) or read only (for textures). GPU's are designed to predominantly read, do a lot of parallel work, then write to a frame buffer.

I don't see why we can't extend CPU's to do this too - by giving them immutable data hints. (then programs require good 'const-correctness' to work well, thats fine by me.. and I'd prefer const to be transitive too..). you could literally say, the cache only works when you provide a 'non -overlapping' or 'immutable' hint. you could have 4 coherent cores, and N non-coherent cores who's caches require extra hints, but still working on the same program, same data.

instead more likely the caches know of eachother what memory they have shared, but this approach isn't scaleable, unless you merely increase the simd-width and not the wavefront-count.

But GPU's are proven to scale. They have hundreds to thousands of cores already. I keep explaining. its' because they don't (usually) read & write in the same area. Maybe they can more now, but they might slow down dramatically where they do.

Graphics runs many independent invocations of functions, reading textures through caches. If you wanted to parallelize 'game-update' to the same extent, you could double-buffer game-state, making it all read-only, like a texture. next_state=update(const previous_state).

that's the secret of gpu: each wavefront can affect only a limited set of data, you can't have one end access the other end's data because all pointers are in-/de-cremented in parallell.


yes, its a bit like a hardware 'map-reduce' (map vertices ; map pixels ; scatter & reduce pixels by z-buffer). Thats' completely scalable.

add another wavefront and you add new problems for your hardware.


but no. it DOES scale, because of the restrictions above.

Give a gpu more cores, running on bigger frame buffers, and it will give more parallelism, ad-infinitum.

more textures being read, more shaders executing independently, more pixels written out...

and no coherence problems, unless you ask for them :)

how many wavefronts can you have running at the same time on nvidia chips? do they share memory? how many mimd-cores do graphics-cards have, when you count each wavefront as a mimd processing unit?


both nvidia & ati work on similar problems, i believe their high end cards both have 100's-1000's of physical computation units. 1000's of threads in flight.

980ti - 2816 'cuda cores'.

I know the limitation about several having the same instruction pointer, but thats' different to the data-flow restriction.

what you said about additional cache-management, preventing reads or preventing writes, that's exactly what I meant when I said that by going parallell you have to cope much more with hidden memory areas outside of the flat shared memory!


but subsets of the flat areas can be flagged immutable or uncached. we do this now. it works.. we have massively-parallel GPU and CPU accessing flat address space in consoles. The flat adress space makes streaming engines more efficient, or allows physics & graphics to share the same vertices/matrices, if need be..

so my suggestion is to write something alike to a shared library. i.e. you mark a function or lambda as designated for epiphany and some dynamic loader then schedules those functions onto epiphany, maybe multiple functions at once to evade epiphany waiting for new uploads.

but you shouldn't have to complicate program strcuture at all for this. just mark data as 'const', and mark as loop as 'parallel' (well, replace it with par_map.. whatever), done. No more bureaucracy needed, and it's as easy as writing loops is now.

If the compiler generates a 'shared library' for your inner loops/lambdas, great.

then the compiler puts machine-code and relocation-table into the binary or into a separate e-Lib. from the programmer's POV these lib-functions then are data that needs to be put on epiphany and started. the only change in the compiler then is some epiphany-attribute for functions, it'll create new pointers in addition to the ones for functions on arm, maybe with an __e_ prefix to the function pointers. mind you, neither does compiler do the cacheing nor whatever dependency-tracking. the functions I listed are of interest here because they are an attempt to organize dependency-tracking!
[/quote][/quote]

Thats' doable, within the #ifdef EPIPHANY... in the appropriate platform dependant version of the parallel abstractions, the offload function can be marked.

The template itself generates boilerplate both on ARM and Epiphany (spawning & managing the worker threads- ARM, and the stubs for the worker themselves - Epiphany).


This is why its' awkward at the moment, for no good reason - other than 'we haven't yet written the tools to handle it'.

if you could write the whole program on epiphany, great. But we don't know what the next board will have.. 4 ARM+64 e-cores? And on the current card the e-cores are crippled by 1.5gb DDR access, you need the ARM + fpga to max out the capability of the memory.

will they stick with traditional ARM cores for running a host? Ideally I would like 1 ISA handling everything. Its just evolutionary path that means we don't have that now, IMO. (you could probably make something like the e-core using the ARM isa. I've already seen MIPS adapted with scratchpads. Or vica versa, you could make a 'main CPU' that runs the Epiphany ISA , recompile linux for that..

Dynamic loader then schedules those functions onto epiphany, maybe multiple functions at once to evade epiphany waiting for new uploads.


this is interesting detail - the point of abstractions is: the actual implementations can be more elaborate, doing what you suggest. how will they be grouped.. maybe many common functions can be clustered.. maybe when you try to run a new task, you can check which core already has it uploaded, etc etc.

My goal is to separate all this from the main body of the program. Not because "i don't want to deal with it", but because "I know the actual complexity will explode..", and "don't repeat yourself".

By writing this way- I'll be able to put *more* time into efficient core management schemes - being able to tweak it independantly of the program meaning, and without worrying about it getting in the way, and re-using the resulting schemes all over the place.


Keep in mind the need for cross platform development.

You need to hedge your bets, and develop software that can run on different architectures. Is adapteva actually going to get traction and deliver? or will we be stuck with the CPU/GPU combination?

Why would I want to write code specifically for one platform if I don't even know its' going to be available :)

whereas if you can write one program that compiles reasonably well for x86, ARM or ARM+Epiphany, or ST P2012 , or ARM with some FPGA inner loops, or Intel Knights Landing... thats' so much better.

All these platforms will independently evolve from generation to generation, based on their ability to run common software. e.g. already, GPU's changed away from VLIW based on benchmarking Cuda/OpenCL code. Hardware and Software bounce off each other.

I've seen SO many platforms come and go now that I view them in a continuum.

SMP and Knights Landing ('predicated SIMD'.. SIMD thats more general) give a reason to write programs like this for traditional computers.

If you can just recompile the same application code for epiphany by porting the abstractions, the available software will increase, hence more users, hence more efficient mass production.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Sat Aug 22, 2015 12:15 am

one more important point.

a filename is a terrible place to identify something, because it has no type information

BAD:
"spawn this file to run on this data" - possible error of incompatibility. How do you know what files are valid?

GOOD
"run this function across this data" - the compiler can tell you if its right, or in the case of templates, & overloading - it can even generate or pick the right function for you.

Type information is good,it makes things easier to express, and provides compile-time error checks, and communicates more from one programmer to another.

You should not have to lose this great benefit to make something parallel.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: generating epiphany code from templates, possible?

Postby dobkeratops » Sat Aug 22, 2015 11:03 am

https://msdn.microsoft.com/en-us/library/hh265136.aspx

C++ AMP is very close to what I'm after - this allows fine grain use of GPU code in C++ programs (and note that on existing devices, thats' already 1000's of cores). Parallelism at the granularity of a Lambda, removing the need to create a separate file, making it natural to write as integral part of your program.

They give you 'parallel_for_each', and some helper classes to abstract moving data to & from GPU memory.
The latter is complexity&overhead that you eliminate if you have Unified Memory... it's precisely the problem that AMD's APU approach is designed to solve. To write portable code today, you have to assume you may need that copy.

From their description...
What new language feature does C++ AMP introduce?
Microsoft added the restrict(amp) feature, which you can apply to any function (including lambdas) to declare that the function can be executed on a C++ AMP accelerator. The restrict keyword instructs the compiler to statically check that the function uses only those language features that are supported by most GPUs, for example, void myFunc() restrict(amp) {...}
Microsoft or other implementers of the open C++ AMP spec could add other restrict specifiers for other purposes, including for purposes that are unrelated to C++ AMP.


So it seems what I might be after is C++ AMP support for Epiphany, with 'a different restrict specifier' (e-core code should be more versatile than GPU code).

my idea on abstractions are a little different: instead of just 'parallel_foreach', I would have more ways of encoding a specific data flow, like mapping from one vector to another, 'zip' (2 inputs , 1 output), combining with gather or scatter operations, etc.

There'd be potentially unlimited permuations, and you'd maybe tame that with lazy-expression templates mapping the ones you actually write.. e.g eventually you'd just write something like ' src1.gather_by_indices(src2).map([](){...}).filter([](){...}).scatter_by(src3).into(dest)' and that would call gather_by_indices__filter__scatter_by__into(src1,src2,fn1,fn2,src3,dest);

I want this to abstract the use of DMA vs CPU pointers. because a GPU has caches, it doesn't need DMA, which is why they do it that way.

Also their interfaces were evidently designed before C++ had Polymorphic Lambdas.

I was already aware of these from Rust (if C++ didn't get them, I would have moved over to Rust permanently)- they clean up a lot of the noise. e.g. where they have you manually writing 'index<1> idx' in the lambda argument, with PolyMorphic Lambdas, that can be infered from the other parameters (auto idx).

originally Rust had a great feature that made it more natural, 'do notation' (not the same as haskell), there you could write ' do parallel_foreach(data) |&d|{...what to do with element 'd'.. }' . Swift (and a few other languages) also has this.. if you place a lambda after a function call, it sticks the lambda in as the last parameter.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

PreviousNext

Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 14 guests

cron