Parallella Community

by **dobkeratops** » Sun Jun 07, 2015 6:50 pm

r.e. past lessons from CELL,

One of the things that made it hard to use vs a conventional machine (hence unpopular with developers) was the way it deals with latency;
The programmer needed to figure out up-front, what data is arriving when, and software had to manually schedule. Hence the patterns such as double buffered DMA and so on. Works fine for streaming data through the same kernel, but is hard to apply to complex, varied workloads.

GPUs and CPUs deal with the problem with threading and/or OOOE . When waiting for a resource, one thread simply stalls allowing another to run (or earlier instructions complete). This is easier for software to scale between generations of hardware, since it's not programmed around specific buffer sizes and timings and so on.

In theory sorting and restructuring how you approached everything was the solution for CELL; in practice developers rarely had time to do this for cross platform projects and the PS3 was a hated platform, and Sony ditched the architecture.

Does Parallella have the same 'issue' - and requires this kind of rethink, paying more attention to sorting data & code together.

if it does, fair enough. I understand this may be the machines' reason for existing, a machine designed for specialised tasks that CPUs and GPUs can't as yet handle efficiency. I've been through the exercise in the past when dealing with CELL. I know it *is* possible to do just about anything this way, it *is* an enjoyable process (from the perspective of mental excercise) however having done it I completely understand why some say a new language is required (now I see Erlang is taken a little more seriously, and maybe there are other options). For handling complex usecases I imagine some language implementation actually doing the sorting on the CPU and distributing data between cues.. throw code for a new task onto a core , throw the data at it.. Sony had a 'job'/'task' manager a bit like this but most people used lower level access, adapting C/C++ sourcebases the hard way

by **piotr5** » Sun Jun 07, 2015 9:44 pm

not sure what you're talking about. however, scheduling data-arrival in c++ is a possibility: request the data in the constructor and then do something else while data is arriving. or if you're in the middle of the previous task, the counting iterator might get abducted to trigger some function at a particular iteration step. so, the actual problem imho is data-collection. find out at which iteration-step you can start the transfer by observing how many iterations are still remaining when the data arrived. this way the core producing the stuff to be transfered has more time for finishing its job. and naturally before doing that you need to actually see how cores waste time with waiting. I've seen visualizations of how a program on one core needs to wait for another core, but mostly these were hand-drawn illustrations in science papers. would be nice to have a program producing them...

by **dobkeratops** » Sun Jun 07, 2015 10:18 pm

by **piotr5** » Sun Jun 07, 2015 11:27 pm

as I said elsewhere, for automatic profiling&optimization you'd need a compiler that produces multiple binaries from multiple source-files -- haven't seen such a thing. it seems epiphany is designed in a way as to simplify calculating execution time. but emulating cache-behaviour still is extremely difficult without help from the programmer or some libs.

implementing double-buffer in c++ I'd do in a slightly different way. you need info on the loop-counter to effectively schedule dma-transfers. so I'd put some pre-buffering function into the iterator which is used inside of the func parameter, at its construction or make it part of the iterator object. if it's part of the iterator, you could put a #if around the type-definition, so host might omit using the buffer while cores do use it. i.e. I wouldn't actually create a double-buffer function but rather a double-buffer container which comes with a special iterator -- so that you may use the std-lib algorithms on that and double-buffer will be used for them too. in your code it would be difficult to implement delayed pre-buffering.

please notice that lambdas are not really such a great new invention in c++. their only advantage is to put the function you are passing right into the same place of sourcecode where it's actually passed to some other function. before you'd put the function into some class, maybe in the header-files, giving that class a meaningful name, now it can stay anonymous. not really an improvement in readability of code, but at least eliminates clutter from the header-files.

by **dobkeratops** » Mon Jun 08, 2015 1:14 am

by **dobkeratops** » Mon Jun 08, 2015 1:42 am

the reason I'd been waiting for lambdas..

when we worked on the previous console generation (xbox 360/ps3) we had to transform lots of code that started out as loops, then unroll them, and use tricks to avoid branches (branches killed it's in-order processing)

this was a major pain.. having to go though loops & 'if' statements and manually name things , and relocate them, or cut-and-paste. code was routinely butchered to hell.

lambdas would have been perfect... turn 'for (...){ if (...){...} else {...}}' into 'for_each_if_else_unroll_4(collection,[](){..}, [](){...}, [](){....}'. I start wanting named parameters and so on too.. you could clarify it with expression templates but there's so much boilerplate setting them up. collection.If([](){.....}).then_do([](){....}).else_do([](){....}).end_if(); // a bit clearer

by **piotr5** » Mon Jun 08, 2015 10:34 am

by **dobkeratops** » Mon Jun 08, 2015 4:03 pm

by **piotr5** » Wed Jun 10, 2015 10:43 pm

thanks for all these insights into work-in-progress. I'm not convinced of usefulness for concepts, the way c++ implements trait-classes is sufficient IMHO. what you wrote I've seen implemented in c++ but using the << or , operator instead of the dot. i.e. put the filter into the map-function-object and put the result into the execute-object (which has a pointer to the container) to apply filtering and mapping. of course I haven't seen it applied to parallell processing though.

as for you mentioning "boilerplate" every now and then, I'd say most of c++ capabilities is boiler-plate: nobody ever implemented it in a library, but it's still a possibility...

by **dobkeratops** » Thu Jun 11, 2015 8:46 am

Parallella Community

latencies, DMA, scheduling- vs caches & threads

latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Re: latencies, DMA, scheduling- vs caches & threads

Who is online