r.e. past lessons from CELL,
One of the things that made it hard to use vs a conventional machine (hence unpopular with developers) was the way it deals with latency;
The programmer needed to figure out up-front, what data is arriving when, and software had to manually schedule. Hence the patterns such as double buffered DMA and so on. Works fine for streaming data through the same kernel, but is hard to apply to complex, varied workloads.
GPUs and CPUs deal with the problem with threading and/or OOOE . When waiting for a resource, one thread simply stalls allowing another to run (or earlier instructions complete). This is easier for software to scale between generations of hardware, since it's not programmed around specific buffer sizes and timings and so on.
In theory sorting and restructuring how you approached everything was the solution for CELL; in practice developers rarely had time to do this for cross platform projects and the PS3 was a hated platform, and Sony ditched the architecture.
Does Parallella have the same 'issue' - and requires this kind of rethink, paying more attention to sorting data & code together.
if it does, fair enough. I understand this may be the machines' reason for existing, a machine designed for specialised tasks that CPUs and GPUs can't as yet handle efficiency. I've been through the exercise in the past when dealing with CELL. I know it *is* possible to do just about anything this way, it *is* an enjoyable process (from the perspective of mental excercise) however having done it I completely understand why some say a new language is required (now I see Erlang is taken a little more seriously, and maybe there are other options). For handling complex usecases I imagine some language implementation actually doing the sorting on the CPU and distributing data between cues.. throw code for a new task onto a core , throw the data at it.. Sony had a 'job'/'task' manager a bit like this but most people used lower level access, adapting C/C++ sourcebases the hard way