On the Use of a Many-core Processor for Computational Fluid

Announcements of academic papers and technical reports based on Parallella or the Epiphany architecture.

On the Use of a Many-core Processor for Computational Fluid

Postby sebraa » Mon Jul 27, 2015 2:04 pm

Title: On the Use of a Many-core Processor for Computational Fluid Dynamics Simulations

Link: http://www.sciencedirect.com/science/ar ... 0915011564

Author: Sebastian Raase, Tomas Nordström

Publication: International Conference On Computational Science, ICCS 2015

Source: Yes, see attachment. (Note: This is my code from the master thesis; there have been no substantial changes for the paper.)

Keywords: Many-core; Epiphany; Computational Fluid Dynamics; Lattice Boltzmann
Attachments
lbe.tar.gz
Lattice Boltzmann on Epiphany
(11.35 KiB) Downloaded 641 times
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: On the Use of a Many-core Processor for Computational Fl

Postby aolofsson » Mon Jul 27, 2015 3:34 pm

Very nice work!!

Have you looked into scaling the work to more cores. How many cores and memory/core do you need to make the off chip bandwidth issue "go away"?

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: On the Use of a Many-core Processor for Computational Fl

Postby sebraa » Mon Jul 27, 2015 7:07 pm

The external memory bandwidth limits the number of lattices I can copy out to the host. As I have been told at the presentation conference, often one doesn't need the whole lattice, but only selected variables (e.g. density, velocity) for each node, basically reducing the required bandwidth from 19 to 2~3 floats per node (in 3D).

However, this assumes that the Epiphany contains the whole lattice, so the maximum simulation size is limited by the number of cores and the amount of memory per core. In real simulations, you'd need lattices about 4-5 orders of magnitude larger (1..10 GB, maybe more), so keeping the lattice fully in local memory is plainly infeasible. Then, the external memory bandwidth is dictated by the processing speed.

With my current code, each core processes about 2.8 millions of nodes (2D) or 0.34 millions of nodes (3D) per second, which translates to about 100 MB/s (2D) or 26 MB/s (3D). However, especially the 3D case could probably be optimized further (I couldn't test with -O3, which helped the 2D case immensely).

On a side-note: It scaled perfectly on the 64-core Epiphany as well.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: On the Use of a Many-core Processor for Computational Fl

Postby aolofsson » Mon Jul 27, 2015 7:17 pm

Based on experience with Epiphany in many other domains, it's much more interesting to remove the DRAM from the equation (getting rid of the training wheels:-)).

How would your code perform on a system with 64 x 64 x 64 3D torus of Epiphany-III cores?
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: On the Use of a Many-core Processor for Computational Fl

Postby sebraa » Mon Jul 27, 2015 7:34 pm

aolofsson wrote:Based on experience with Epiphany in many other domains, it's much more interesting to remove the DRAM from the equation (getting rid of the training wheels:-)).
I don't particularly care if there is a DRAM or not. :-) But for a Lattice Boltzmann implementation, I have to keep a lot of state (the lattice) between timesteps, and if I cannot keep it inside the Epiphany, then I have to store it somewhere else and both read and write it once per timestep.

aolofsson wrote:How would your code perform on a system with 64 x 64 x 64 3D torus of Epiphany-III cores?
The slowest core decides the speed of the whole system, so the iteration time it is limited by (a) processing speed, not an issue; (b) overhead of barriers; (c) read and write accesses to the next neighbor(*); (d) getting the results out of the system. So it should perform similarly. :-)

(*) It is possible to get rid of the reads by having an extra boundary layer and some additional synchronization.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: On the Use of a Many-core Processor for Computational Fl

Postby jar » Mon Jul 27, 2015 8:13 pm

The Lattice Boltzmann method is an O(n) method per node per iteration so no amount of core or data scaling will help reduce the off-chip bandwidth overhead for this implementation. The key point in this method is that it is iterative and converges to a steady state solution. So there can be a lot of work on the device without ever having to copy a result to shared memory (DRAM) if the cores share boundary data using inter-core communication. I enjoyed reading the paper, but I think the lack of inter-core communication was one of the weaker points. After each lattice update, each core should write edge node data to the appropriate neighboring core memory rather than to shared memory.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: On the Use of a Many-core Processor for Computational Fl

Postby sebraa » Mon Jul 27, 2015 8:31 pm

Instead of writing the boundary data to the neighboring core (which would require additional on-core memory), the code reads the boundary data directly from the neighboring core, using shared memory. It is not necessary to write anything to shared memory (obviously, the host then never gets any results or status updates).
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: On the Use of a Many-core Processor for Computational Fl

Postby aolofsson » Mon Jul 27, 2015 9:05 pm

jar wrote:The Lattice Boltzmann method is an O(n) method per node per iteration so no amount of core or data scaling will help reduce the off-chip bandwidth overhead for this implementation. The key point in this method is that it is iterative and converges to a steady state solution. So there can be a lot of work on the device without ever having to copy a result to shared memory (DRAM) if the cores share boundary data using inter-core communication. I enjoyed reading the paper, but I think the lack of inter-core communication was one of the weaker points. After each lattice update, each core should write edge node data to the appropriate neighboring core memory rather than to shared memory.


Are you referring to sebraa's implementation or the Epiphany in general. If the whole state could be fit in the array, then wouldn't BW go to ~0 after the initial data is brought in? So compute is something like Time * O(n).
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: On the Use of a Many-core Processor for Computational Fl

Postby aolofsson » Mon Jul 27, 2015 9:09 pm

sebraa wrote:I have to keep a lot of state (the lattice) between timesteps, and if I cannot keep it inside the Epiphany, then I have to store it somewhere else and both read and write it once per timestep.


64 x 64 x 64 cores is ~250,000 cores =~ 8GB of Epiphany on chip memory.

Why isn't this enough to store all state?
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: On the Use of a Many-core Processor for Computational Fl

Postby jar » Mon Jul 27, 2015 9:15 pm

aolofsson wrote:Are you referring to sebraa's implementation or the Epiphany in general. If the whole state could be fit in the array, then wouldn't BW go to ~0 after the initial data is brought in? So compute is something like Time * O(n).


I am referring to sebraa's implementation on a hypothetically large enough array of Epiphany cores to fit the problem. You hypothetically take off-chip bandwidth to zero between the initial load and final store after XXXX iterations.

The shared buffered data would require extra memory on the core and is a bit problematic for the 3D case because the edge data is about 2/3rds of your working set* so the cores would be moving a lot of data relative to computation (but it's on-chip rather than off-chip). The 2D case is the most interesting to me and you could do it with less than 15% boundary data overhead**.

If a hypothetical 128 KB Epiphany core existed, the 3D case inter-core communication overhead could be reduced to 43% of the working set***.

* Assuming a large Epiphany chip with an grid of 7x6x7... 1-(5x4x5)/(7x6x7) = 66%
** Assuming a grid of 26x26 .. 1-(24x24)/(26x26) = 14.8%
*** Assuming a large 128KB Epiphany chip with an grid of 12x11x12... 1- (10x9x10)/(12x11x12) = 43%
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Next

Return to Academic Papers

Who is online

Users browsing this forum: No registered users and 1 guest

cron