Epiphany vs Arm performance

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Re: Epiphany vs Arm performance

Postby Nerded » Wed Jul 12, 2017 1:25 pm

Interesting! SHMEM seems to be a good solution. Since I already have the MPI written, I think I will just stick w/ MPI + OpenSHMEM. Thank you for your insight (and code).
Nerded
 
Posts: 19
Joined: Tue Jun 06, 2017 8:30 pm

Re: Epiphany vs Arm performance

Postby Nerded » Wed Jul 12, 2017 4:06 pm

So, if I am understanding the 32KB of memory correctly, this means if I have an array of 57 byte structs, I could only have an array with roughly 574 items per core? If I am computing over a larger dataset I would need the 32MB of DRAM, resulting in a significant performance drop-off?

edit: Also, is this possible with Coprthr 1.6? The download link for Coprthr 2 is broken on the website.
Nerded
 
Posts: 19
Joined: Tue Jun 06, 2017 8:30 pm

Re: Epiphany vs Arm performance

Postby jar » Wed Jul 12, 2017 5:02 pm

This link is broken? http://www.browndeertechnology.com/code ... -2.0.1.tgz

Funny...it seems to work for me. Send me a PM with your email address if you would like me to email it to you.

COPRTHR 1.6 supported OpenCL/STDCL and none of the new COPRTHR device-level interface needed for more efficient execution on Parallella. You can read about the COPRTHR 2.0 improvements in Advances in Run-time Performance and Interoperability for the Adapteva Epiphany Coprocessor

You should probably pad your 57-byte struct, otherwise you have misaligned accesses (I think the compiler should do this unless you give it the packed attribute). If you have any 64-bit struct elements, they should be 8-byte aligned. If your 57 byte struct is 64 bytes in memory, you could fit 512 if you didn't have any instructions or anything else in the core scratchpad memory. But since you probably want to compute, expect your instructions and stack to take up perhaps half of that (this depends on the application and could be less).
User avatar
jar
 
Posts: 275
Joined: Mon Dec 17, 2012 3:27 am

Re: Epiphany vs Arm performance

Postby sebraa » Thu Jul 13, 2017 11:20 am

Nerded wrote:So, if I am understanding the 32KB of memory correctly, this means if I have an array of 57 byte structs, I could only have an array with roughly 574 items per core?
Each core is an independent microprocessor running an independent program. A program consists of "code", "data", "stack", plus user-managed memory (e.g. heap). Both the "data" and "stack" segments must reside in the 32 KB of local memory due to the memory model, and executing code out of non-local memory carries a performance penalty. Also, some part of local memory (256 bytes or so) is reserved as part of the ABI.

Nerded wrote:If I am computing over a larger dataset I would need the 32MB of DRAM, resulting in a significant performance drop-off?
This depends on your application. You can use local memory as a cache of "currently active items", and swap in and out items as needed (possibly using DMA). You can store larger amounts of data in the 32 MB of shared memory, or in other cores. Also, the host application can punch data directly into the cores' local memory, which is faster than the cores reading from DRAM.


In any case, when you want to compute over a larger dataset, memory management will be your most important issue. But you need to keep in mind that the total off-chip bandwidth is limited, no matter what you do.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Epiphany vs Arm performance

Postby Nerded » Thu Jul 13, 2017 1:44 pm

Awesome response sebraa. So, exchanging data between the ARM's memory and each individual eCore is faster than each eCore exchanging data with the shared 32MB?
Nerded
 
Posts: 19
Joined: Tue Jun 06, 2017 8:30 pm

Re: Epiphany vs Arm performance

Postby olajep » Mon Jul 17, 2017 10:38 am

Nerded wrote:Awesome response sebraa. So, exchanging data between the ARM's memory and each individual eCore is faster than each eCore exchanging data with the shared 32MB?

You can use 'e-bandwidth-test' from epiphany-examples to find out which data transfer rates you can expect.
https://github.com/adapteva/epiphany-ex ... width-test
Code: Select all
parallella@parallella:~$ cd ~/epiphany-examples/apps/e-bandwidth-test
parallella@parallella:~$./build.sh
parallella@parallella:~$ ./test.sh
ARM Host    --> eCore(0,0) write spead       =   47.06 MB/s
ARM Host    --> eCore(0,0) read spead        =    6.84 MB/s
ARM Host    --> ERAM write speed             =   91.24 MB/s
ARM Host    <-- ERAM read speed              =  131.05 MB/s
ARM Host    <-> DRAM: Copy speed             =  382.75 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1242.38 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA)  =  401.46 MB/s
eCore (0,0) --> ERAM write speed (DMA)       =  235.80 MB/s
eCore (0,0) <-- ERAM read speed (DMA)        =  160.10 MB/s


Note that those numbers are 2yrs old. Am traveling and couldn't find anything more recent.

// Ola
_start = 266470723;
olajep
 
Posts: 121
Joined: Mon Dec 17, 2012 3:24 am
Location: Sweden

Re: Epiphany vs Arm performance

Postby sebraa » Tue Jul 18, 2017 4:38 pm

Nerded wrote:So, exchanging data between the ARM's memory and each individual eCore is faster than each eCore exchanging data with the shared 32MB?
No, that would be too easy. :mrgreen: All communication between "shared memory" and "Epiphany" uses the same physical wires, and it depends on how you use them to get the maximum performance.

Generally:
  • Writes are much faster than reads.
  • Uninterrupted linear writes (bursts, i.e. only one core sending a big block) are faster than nonlinear writes.
  • Shared memory is uncached for the ARM.
  • DMA allows transfers to happen in the background, but I haven't used it.
  • I would assume DMA writes to be a bit faster than regular writes, for large data blocks (but don't know).
  • I would assume DMA reads to be a lot faster than regular reads (but don't know).

So for input data, you may be best off by putting your data into shared memory, and reading it linearly (one core at a time) into the eCores using the DMA engines or by having the ARM punch the data directly into the eCores. For output data, using the DMA engine to write to the shared memory (again linearly, one core at a time) and have the ARM copy from there to its own memory before processing might be the fastest way. I don't know if the ARM can use DMA to pull the data from shared memory.

Also think about synchronization between the ARM and the Epiphany cores, which will cost additional bandwidth and interrupt the bursts.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Previous

Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 2 guests

cron