Page 1 of 2

Epiphany vs Arm performance

PostPosted: Mon Jul 10, 2017 5:19 pm
by Nerded
Hello,

I am running the omp4-epiphany examples currently. In the pi-kernel example, I am getting an extremely poor benchmark for the epiphany when compared to the Arm. With 16 Epiphany kernels, I complete the calculations in .96 seconds. With 1 arm kernel, I complete the calculations in .16 seconds; and with 2 arm kernals: just .085 seconds. Is this discrepancy of performance expected? If so, what exactly is the point of the epiphany processor? Thank you.

The examples are located here https://github.com/parallella/parallella-examples/blob/master/omp4-epiphany/pi_kernel/pi.c

Re: Epiphany vs Arm performance

PostPosted: Mon Jul 10, 2017 7:17 pm
by jar
A couple things...

1) There is no 64-bit double precision support on E3, so software emulation will have difficulty outperforming ARM cores which have 64-bit DP support. So this is expected.
2) The OpenMP programming model is designed for symmetric multiprocessing (SMP), whereas Epiphany should primarily be thought of as a networked cluster of distributed memory where partitioned global address space (PGAS) models are used. Although OpenMP may use the global shared memory on Parallella, it lacks a semantics within the model to address Epiphany's local core memory.

You had asked earlier whether it was possible, not if it was a good idea. Exploring the OpenMP + MPI programming paradigm is useful for education, but not ideal for a Parallella cluster (with few exceptions).

Re: Epiphany vs Arm performance

PostPosted: Tue Jul 11, 2017 11:38 am
by Nerded
Well, as far as MPI + X goes on the Epiphany, what would be my best option for X in the case of performance? Like I said before, I am working with a cluster of these Parallella boards.

Also, should I expect better performance for the Epiphany than the Arm in general? (given the proper implementation)

Re: Epiphany vs Arm performance

PostPosted: Tue Jul 11, 2017 1:07 pm
by jar
For "X", I use the COPRTHR SDK for offload and OpenSHMEM for on-chip inter-core.

As far as performance goes, it depends on the application. Applications with high arithmetic intensity will do well, but Epiphany on the Parallella board has limited off-chip bandwidth (less than 300 MB/s write). There is a non-zero offload overhead time, but if the computational kernel is called repeatedly, those secondary calls can be fast.

On the Parallella board, there are two ARM Cortex-A9 cores at 667 MHz with 4 SP FLOPS/cycle (NEON SIMD instructions) for peak performance of 2*4*0.667 of 5.336 GFLOPS. Epiphany-III has 16 cores 600 MHz with 2 SP FLOPS/cycle (fused multiply-add) for a peak performance of 19.2 GFLOPS. I find the Epiphany scalar code easier than vector code, particularly when writing optimized assembly.

Good luck!

Re: Epiphany vs Arm performance

PostPosted: Tue Jul 11, 2017 2:30 pm
by Nerded
Awesome! This is some great information. So, seeing as COPRTHR is for the offloading, and that OMP is currently doing the offloading; would MPI + (OpenSHMEM + OpenMP) be possible in this case? I sincerely appreciate your help, however!

Re: Epiphany vs Arm performance

PostPosted: Tue Jul 11, 2017 3:46 pm
by jar
I have never tried (OpenSHMEM + OpenMP) and I have some doubts that it will work. Each has a different memory model, but I'm not familiar enough with the OpenMP implementation to say with certainty that they're incompatible.

Re: Epiphany vs Arm performance

PostPosted: Tue Jul 11, 2017 3:59 pm
by Nerded
Okay, are there any examples that you know of, of MPI + a threaded OpenSHMEM implementation? Or, I guess, just a threaded OpenSHMEM implementation.

Re: Epiphany vs Arm performance

PostPosted: Tue Jul 11, 2017 8:50 pm
by jar
Start with this one:
https://github.com/USArmyResearchLab/op ... le/c_nbody

Here are a few more:
https://github.com/USArmyResearchLab/op ... r/example/

The examples include just the device-level parallelism (threaded OpenSHMEM). If you find you like OpenSHMEM, you could use that instead of MPI for your node-level parallelism. So you would effectively have an OpenSHMEM + OpenSHMEM code.

Re: Epiphany vs Arm performance

PostPosted: Wed Jul 12, 2017 12:55 am
by Nerded
Interesting! I was definitely unfamiliar with the scope of OpenSHMEM. Would you believe this to be a more efficient method of doing cross node communications?

Also, a beginner SHMEM question, but does it effectively pool the memory of the 16 e cores? As in, I am working with 512kb instead of 32kb × 16?

Re: Epiphany vs Arm performance

PostPosted: Wed Jul 12, 2017 1:33 am
by jar
SHMEM was originally an API developed by Cray back in the 1990s. It has more recently become standardized so that there are many implementations that follow a common API. In my opinion, it's a cleaner, more intuitive API to MPI 1 and 2 for communication primitives and one-sided communication. MPI version 3 added one-sided communication. MPI has a lot of other routines that handle parallel file I/O (things that don't really apply for Epiphany). OpenSHMEM focuses moving data in a one-sided, asynchronous manner. There are a bunch of test codes so you can see how you might use specific routines.

You can also read the OpenSHMEM 1.3 specification

Nerded wrote:Also, a beginner SHMEM question, but does it effectively pool the memory of the 16 e cores? As in, I am working with 512kb instead of 32kb × 16?


No, OpenSHMEM efficiency is gained with partitioned memory. But we are working on something else for the lazy :-)