Parallella Community

Posted: **Mon Jul 27, 2015 2:04 pm**

Title: On the Use of a Many-core Processor for Computational Fluid Dynamics Simulations

Link: http://www.sciencedirect.com/science/ar ... 0915011564

Author: Sebastian Raase, Tomas Nordström

Publication: International Conference On Computational Science, ICCS 2015

Source: Yes, see attachment. (Note: This is my code from the master thesis; there have been no substantial changes for the paper.)

Keywords: Many-core; Epiphany; Computational Fluid Dynamics; Lattice Boltzmann

Posted: **Mon Jul 27, 2015 3:34 pm**

Very nice work!!

Have you looked into scaling the work to more cores. How many cores and memory/core do you need to make the off chip bandwidth issue "go away"?

Andreas

Posted: **Mon Jul 27, 2015 7:07 pm**

The external memory bandwidth limits the number of lattices I can copy out to the host. As I have been told at the presentation conference, often one doesn't need the whole lattice, but only selected variables (e.g. density, velocity) for each node, basically reducing the required bandwidth from 19 to 2~3 floats per node (in 3D).

However, this assumes that the Epiphany contains the whole lattice, so the maximum simulation size is limited by the number of cores and the amount of memory per core. In real simulations, you'd need lattices about 4-5 orders of magnitude larger (1..10 GB, maybe more), so keeping the lattice fully in local memory is plainly infeasible. Then, the external memory bandwidth is dictated by the processing speed.

With my current code, each core processes about 2.8 millions of nodes (2D) or 0.34 millions of nodes (3D) per second, which translates to about 100 MB/s (2D) or 26 MB/s (3D). However, especially the 3D case could probably be optimized further (I couldn't test with -O3, which helped the 2D case immensely).

On a side-note: It scaled perfectly on the 64-core Epiphany as well.

Posted: **Mon Jul 27, 2015 7:17 pm**

Based on experience with Epiphany in many other domains, it's much more interesting to remove the DRAM from the equation (getting rid of the training wheels:-)).

How would your code perform on a system with 64 x 64 x 64 3D torus of Epiphany-III cores?

Posted: **Mon Jul 27, 2015 7:34 pm**

Posted: **Mon Jul 27, 2015 8:13 pm**

The Lattice Boltzmann method is an O(n) method per node per iteration so no amount of core or data scaling will help reduce the off-chip bandwidth overhead for this implementation. The key point in this method is that it is iterative and converges to a steady state solution. So there can be a lot of work on the device without ever having to copy a result to shared memory (DRAM) if the cores share boundary data using inter-core communication. I enjoyed reading the paper, but I think the lack of inter-core communication was one of the weaker points. After each lattice update, each core should write edge node data to the appropriate neighboring core memory rather than to shared memory.

Posted: **Mon Jul 27, 2015 8:31 pm**

Instead of writing the boundary data to the neighboring core (which would require additional on-core memory), the code reads the boundary data directly from the neighboring core, using shared memory. It is not necessary to write anything to shared memory (obviously, the host then never gets any results or status updates).

Posted: **Mon Jul 27, 2015 9:05 pm**

Posted: **Mon Jul 27, 2015 9:09 pm**

Posted: **Mon Jul 27, 2015 9:15 pm**

Parallella Community

On the Use of a Many-core Processor for Computational Fluid

On the Use of a Many-core Processor for Computational Fluid

Re: On the Use of a Many-core Processor for Computational Fl

Re: On the Use of a Many-core Processor for Computational Fl

Re: On the Use of a Many-core Processor for Computational Fl

Re: On the Use of a Many-core Processor for Computational Fl

Re: On the Use of a Many-core Processor for Computational Fl

Re: On the Use of a Many-core Processor for Computational Fl

Re: On the Use of a Many-core Processor for Computational Fl

Re: On the Use of a Many-core Processor for Computational Fl

Re: On the Use of a Many-core Processor for Computational Fl