by mhonman » Thu Aug 01, 2013 8:33 pm
Apologies in advance, this has become a mini-essay (or sermon?) and is loaded with opinion, possibly outdated views, and a fair amount of Parallel Processing 101 material that should be obvious to most who post here.
But, to the original question "is our mindset wrong", my CFD-tinged view is "yes and no".
Yes, because we are so accustomed to breaking down the solution to a problem into a sequence of steps, that it is hard to recover inherent parallelism at a later stage. So IMO the dreamt-of parallelizing compilers are a wrong-end-of-the-telescope approach.
No, because problem domains (in CFD at least) are continuous, and are described by continuous equations. The solution in every part of the problem domain is coupled to every other part. No matter how you slice and dice the parallelism in the problem, a naive solution demands a phenomenal amount of communication.
So in a nutshell there is no substitute for a smart person (or team) who have an excellent understanding of the problem domain and the limitations of the computing device at hand. Efficiencies differ by orders of magnitude between naive solutions and ones where simplifying assumptions have been introduced - so a clever solution may not need a parallel system in the first place.
The ideal is of course a clever solution which retains the inherent parallelism of the problem domain - but since this requires the engineers or scientists to "think parallel" that solution can be mapped onto any sufficiently flexible parallel computing platform.
Aside: There *are* other ways of solving these problems, one that used to be popular is analog CFD, often known as a wind tunnel. Other than the millions that it costs to build and operate a transonic wind-tunnel, and the time needed to produce a series of precision CNC-machined models, it works quite well.
Jokes aside, there are programming languages that make it easier to express the solution to a problem in concurrent terms - some of these have their own sub-forums here and they would be worth investigating before attempting to write parallel programs in C.
But I digress...
In order to really reap the benefits of parallel processing both the system and the software must be scalable, i.e. overheads remain constant when the problem size and parallelism increase in proportion to each other.
That in turn means that a good architecture should be a one in which you get more of everything when the number of processing elements is increased - more memory size, memory bandwidth, and communication bandwidth - and processing elements are functionally independent of each other. That allows a big problem to be broken down into a multiplicity of localised partial solutions, without the processing elements having to contend for shared resources (and therefore have performance limited by the speed of those resources).
As an architecture, Epiphany does all of this really well. In particular communication is low-latency and fast in comparison to its computational power, and non-localised communication is not prohibitively expensive.
And we have to remember this is not an HPC cluster we are discussing, it is a single inexpensive chip - so yes, it has its limitations (per-core RAM in particular).
However all of these really neat features are of no use unless we are able to "think parallel" in devising algorithms that solve our problems. For example, to think in terms of what happens at a cell in solution space and how it interacts with adjacent cells (cellular automata are a classic example), rather than thinking in terms of a series of operations that are performed on arrays or matrices - scalable parallelisation of which is usually predicated on an infinitely fast global communication fabric.
Once a parallel algorithm has been devised, it should usually be possible to adapt it for implementation on whatever computing device is to hand - though IMO a MIMD archictecture like Epiphany is relatively easy to target because the programmer is not restricted to having all the processing elements operating in lock-step, and is thus not forced to think parallel at all times.
That said, 20 years ago people were successfully implementing some pretty sophisticated CFD methods on the 65536-core SIMD Connection Machine.