Memcpy slow compared to manual copy of data

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Memcpy slow compared to manual copy of data

Postby pascallj » Thu Jun 15, 2017 12:40 pm

Hi all,

To test the different methods of data movement from core to core for speeds, I made a small benchmark which copies 4K memory from one memory location to another. I tested it with manual copying and using memcpy (DMA will follow). However I noticed, when manually copying the data it is a lot faster than when using memcpy. Even though memcpy moves bytes and by hand I am moving 64 bits, it is strange why it is this slow. I've got these results:

Code: Select all
4K Copy:
0x6000-0x7000            Manual(32bit)   28771
0x6000-0x7000            Manual(64bit)   14435
0x6000-0x7000            Memcopy      ~676000
0x80806000-0x80807000      Manual(32bit)   41059
0x80806000-0x80807000      Manual(64bit)   20579
0x80806000-0x80807000      Memcopy      ~676000
0x80906000-0x80907000      Manual(32bit)   44131
0x80906000-0x80907000      Manual(64bit)   22116
0x80906000-0x80907000      Memcopy      ~676000
0x6000-0x80907000         Manual(32bit)   29796
0x6000-0x80907000         Manual(64bit)   14948
0x6000-0x80907000         Memcopy      ~676000
0x80906000-0x7000         Manual(32bit)   45157
0x80906000-0x7000         Manual(64bit)   22629
0x80906000-0x7000         Memcopy      ~676000


And I am using this code to benchmark it:

Code: Select all
unsigned BENCHMARK_VALUE = 0;
void* source = (void*)0x6000;
void* dest = (void*)0x7000;

e_ctimer_set(E_CTIMER_0, E_CTIMER_MAX);
e_ctimer_start(E_CTIMER_0, E_CTIMER_CLK);
for (int i=0; i<512; i++) {
   //*(uint64_t*)(dest+i*8) = *(uint64_t*)(source+i*8);
}
memcpy(dest, source, 4096);
BENCHMARK_VALUE = e_ctimer_stop(E_CTIMER_0)
BENCHMARK_VALUE = E_CTIMER_MAX - BENCHMARK_VALUE


where I can test both instances by commenting out either the manual copy, or the memcpy.

My custom linker script ensures region 0x4000-0x7fff is free to use. Compiler optimizations are turned off. However when I turn them on (O3), my manual copy also takes ~676000 cycles to finish. With (O1) or (O2) number of cycles is reduced to 4697 and 4181. I calculated, with 14435 cycles (where 5740 cycles are just the loop doing nothing), the copying speed is roughly 450 MB/s, which is close to what I am reading here. Because e_write is based on memcpy, this achieves the same results.

Does anyone knows what I am missing here?

Regards,
Pascal
pascallj
 
Posts: 5
Joined: Thu Jun 15, 2017 12:19 pm

Re: Memcpy slow compared to manual copy of data

Postby pascallj » Thu Jun 15, 2017 2:03 pm

Think I got at least one problem figured out. If copy memcpy from Epiphany libgcc source into my own code, it results in roughly the same benchmark values as when I adapt my own code to copy 8 instead of 64 bits. So I guess the low memcpy performance comes from the standard C library being in the external dram instead of SRAM.

However I am still wondering why O3 will, instead if optimizing my code for speed, makes my code so much slower.
pascallj
 
Posts: 5
Joined: Thu Jun 15, 2017 12:19 pm

Re: Memcpy slow compared to manual copy of data

Postby sebraa » Thu Jun 15, 2017 4:06 pm

Do not execute any code from external memory if you want reasonable performance. Use internal.ldf as linker script.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Memcpy slow compared to manual copy of data

Postby pascallj » Thu Jun 15, 2017 4:23 pm

Performance is my number 1 priority. However with internal.ldf, not all of my codes fits in the memory according to the linker error. I am currently using a modified ldf based on fast.ldf, so all my own code and data is in SRAM and only the libraries are on the DRAM. Is it possible to put everything in internal ram?

I am including only these:

Code: Select all
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <e-lib.h>
pascallj
 
Posts: 5
Joined: Thu Jun 15, 2017 12:19 pm

Re: Memcpy slow compared to manual copy of data

Postby jar » Thu Jun 15, 2017 5:08 pm

You can use my shmemx_memcy routine if you need best performance. Nothing else beats it while also handling address misalignment.
https://github.com/USArmyResearchLab/op ... x_memcpy.c
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Memcpy slow compared to manual copy of data

Postby pascallj » Thu Jun 15, 2017 5:25 pm

Thanks! I am using the Parallella board for my bachelor thesis so maybe it is nice to compare the results I finally get with your implementation.
pascallj
 
Posts: 5
Joined: Thu Jun 15, 2017 12:19 pm

Re: Memcpy slow compared to manual copy of data

Postby GreggChandler » Fri Jun 16, 2017 7:44 am

jar wrote:You can use my shmemx_memcy routine if you need best performance. Nothing else beats it while also handling address misalignment.
https://github.com/USArmyResearchLab/op ... x_memcpy.c


I believe that I shared with you, albeit privately, that my C++ memory move beat yours--especially in larger blocks. I spent about fifteen minutes writing it. It also handles misaligned memory blocks. My recollection was that your code had more stalls. Your code was smaller, which one would expect for assembly.
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: Memcpy slow compared to manual copy of data

Postby GreggChandler » Fri Jun 16, 2017 8:46 am

pascallj wrote:And I am using this code to benchmark it:

Code: Select all
unsigned BENCHMARK_VALUE = 0;
void* source = (void*)0x6000;
void* dest = (void*)0x7000;

e_ctimer_set(E_CTIMER_0, E_CTIMER_MAX);
e_ctimer_start(E_CTIMER_0, E_CTIMER_CLK);
for (int i=0; i<512; i++) {
   //*(uint64_t*)(dest+i*8) = *(uint64_t*)(source+i*8);
}
memcpy(dest, source, 4096);
BENCHMARK_VALUE = e_ctimer_stop(E_CTIMER_0)
BENCHMARK_VALUE = E_CTIMER_MAX - BENCHMARK_VALUE



There is often a trade-off between performance and code size. One of the quickest ways to get more performance here is to "unwind" your loop a bit. In my case, I used memory pointers and auto incremented their values, and had 8 consecutive instructions. Something more like:

Code: Select all
uint64_t * d;
uint64_t * s;
d  = ...
s = ...
for(I=0;I<512/8;I++)
  {
  *d++ = *s++;
  *d++ = *s++;
  ...
  *d++ = *s++;
  }


I leave it for you to fill in and complete the above code. Hopefully you get the idea. There are also efficient ways to deal with extra bytes at the beginning and end if you want more general code.
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: Memcpy slow compared to manual copy of data

Postby sebraa » Fri Jun 16, 2017 12:16 pm

pascallj wrote:Performance is my number 1 priority. However with internal.ldf, not all of my codes fits in the memory according to the linker error. I am currently using a modified ldf based on fast.ldf, so all my own code and data is in SRAM and only the libraries are on the DRAM. Is it possible to put everything in internal ram?
If your code and the libraries you use are bigger than the internal memory, then it is not possible. Either reduce the code size or library dependencies. You can check with e-objdump (or e-nm) which functions are located in shared memory, and figure out which ones are performance-critical.


Including header files does not increase the code size. Using the functions does increase the code size, e.g. printf() and related functions are relatively big.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Memcpy slow compared to manual copy of data

Postby DonQuichotte » Fri Jun 16, 2017 3:09 pm

Totally agree with sebraa.
Performance is my only goal.
Use internal.ldf or nothing.
Avoid using the libraries as much as possible.
gcc bit handling emulation for example is really poor, I had to replace them with my own (popcount, clz).
User avatar
DonQuichotte
 
Posts: 46
Joined: Fri Apr 29, 2016 9:58 pm

Next

Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 4 guests

cron