Memcpy slow compared to manual copy of data

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Re: Memcpy slow compared to manual copy of data

Postby jar » Fri Jun 16, 2017 4:59 pm

GreggChandler wrote:I believe that I shared with you, albeit privately, that my C++ memory move beat yours--especially in larger blocks. I spent about fifteen minutes writing it. It also handles misaligned memory blocks. My recollection was that your code had more stalls. Your code was smaller, which one would expect for assembly.


I looked up your results you sent me in your PM and the Core->External copies were tied (+/- 1%) in performance because it's not a core/processing limitation. The off-chip interface is very slow compared to the on-chip network and stalls will dominate any memcpy routine. The shmemx_memcpy routine uses a hardware loop, is properly pipelined, handles misalignment and corrects misalignment. And it uses short (16-bit) instructions where possible. There is no branch penalty for the hardware loop vs. a C for loop emitted by GCC. GCC will also increment the pointer offset in one instruction rather than emitting code which uses the post-modify increment feature for ldrd/strd instructions. Optimized memcpy routines across many systems look complicated like this and it's unfortunate that the GNU memcpy routine wasn't optimized.

When I say that shmemx_memcpy handles and corrects misalignment in a high performance manner, this is what I mean:
Code: Select all
char a[8192], b[8192]; // assume both 'a' and 'b' are 8-byte aligned
char* ap = a+1; // 1-byte misalignment offset
char* bp = b+9; // 9-byte misalignment offset, but both ap and bp have alignment on 8-byte boundaries after the initial offsets.
shmemx_memcpy(ap, bp, 8183); // This will work and approach 2.4 GB/s (8 bytes per 2 clocks)


IMO, this routine should be considered as a replacement for the GNU memcpy routine so I can just remove it from the OpenSHMEM library and use memcpy.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Memcpy slow compared to manual copy of data

Postby pascallj » Tue Jun 20, 2017 10:46 am

GreggChandler wrote:[..]
There is often a trade-off between performance and code size. One of the quickest ways to get more performance here is to "unwind" your loop a bit. In my case, I used memory pointers and auto incremented their values, and had 8 consecutive instructions. Something more like:

Code: Select all
uint64_t * d;
uint64_t * s;
d  = ...
s = ...
for(I=0;I<512/8;I++)
  {
  *d++ = *s++;
  *d++ = *s++;
  ...
  *d++ = *s++;
  }


I leave it for you to fill in and complete the above code. Hopefully you get the idea. There are also efficient ways to deal with extra bytes at the beginning and end if you want more general code.


Thanks, I already optimized my code using addition instead of multiplication, which indeed does increase the performance a bit. However if I remember correctly, once compiler optimizations are enabled, the compiler will already perform these kind of optimizations.

sebraa wrote:[..]
If your code and the libraries you use are bigger than the internal memory, then it is not possible. Either reduce the code size or library dependencies. You can check with e-objdump (or e-nm) which functions are located in shared memory, and figure out which ones are performance-critical.

Including header files does not increase the code size. Using the functions does increase the code size, e.g. printf() and related functions are relatively big.


You are right. I had one snprintf call, which apparently is really big in size. Without that call, the code does fit quite nicely.
pascallj
 
Posts: 5
Joined: Thu Jun 15, 2017 12:19 pm

Re: Memcpy slow compared to manual copy of data

Postby jar » Tue Jun 20, 2017 2:02 pm

If you need to use printf, the COPRTHR 2.0 SDK solves this issue with only a few bytes of code (the character string is even stored off-chip). And you can call arbitrary routines on the host from an Epiphany core with little overhead. Many of the details are in "Advances in Run-Time Performance and Interoperability for the Adapteva Epiphany Coprocessor". You can download the COPRTHR SDK here: http://www.browndeertechnology.com/reso ... prthr2.htm
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Memcpy slow compared to manual copy of data

Postby olajep » Tue Jun 20, 2017 4:14 pm

1. Use __builtin_assume_aligned

Code: Select all
...
void* source = __builtin_assume_aligned((void*)0x6000, 8);
void* dest = __builtin_assume_aligned((void*)0x7000, 8);
...


2. We want to use gcc's builtin version of memcpy, not the newlibc one. GCC builtins are enabled when you enable optimizations. -O will do.
Code: Select all
parallella@parallella:~ $ e-gcc -O myprogram.c -o myprogram


Does that improve your results?

// Ola
_start = 266470723;
olajep
 
Posts: 139
Joined: Mon Dec 17, 2012 3:24 am
Location: Sweden

Re: Memcpy slow compared to manual copy of data

Postby GreggChandler » Tue Jun 20, 2017 11:37 pm

pascallj wrote:
Thanks, I already optimized my code using addition instead of multiplication, which indeed does increase the performance a bit. However if I remember correctly, once compiler optimizations are enabled, the compiler will already perform these kind of optimizations.



There are no multiplications in the code sample that I supplied! The compiler generates pretty good code in this case--especially with optimizations enabled.
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Previous

Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 7 guests

cron