Re: Memcpy slow compared to manual copy of data

GreggChandler wrote:I believe that I shared with you, albeit privately, that my C++ memory move beat yours--especially in larger blocks. I spent about fifteen minutes writing it. It also handles misaligned memory blocks. My recollection was that your code had more stalls. Your code was smaller, which one would expect for assembly.
I looked up your results you sent me in your PM and the Core->External copies were tied (+/- 1%) in performance because it's not a core/processing limitation. The off-chip interface is very slow compared to the on-chip network and stalls will dominate any memcpy routine. The shmemx_memcpy routine uses a hardware loop, is properly pipelined, handles misalignment and corrects misalignment. And it uses short (16-bit) instructions where possible. There is no branch penalty for the hardware loop vs. a C for loop emitted by GCC. GCC will also increment the pointer offset in one instruction rather than emitting code which uses the post-modify increment feature for ldrd/strd instructions. Optimized memcpy routines across many systems look complicated like this and it's unfortunate that the GNU memcpy routine wasn't optimized.
When I say that shmemx_memcpy handles and corrects misalignment in a high performance manner, this is what I mean:
- Code: Select all
char a[8192], b[8192]; // assume both 'a' and 'b' are 8-byte aligned
char* ap = a+1; // 1-byte misalignment offset
char* bp = b+9; // 9-byte misalignment offset, but both ap and bp have alignment on 8-byte boundaries after the initial offsets.
shmemx_memcpy(ap, bp, 8183); // This will work and approach 2.4 GB/s (8 bytes per 2 clocks)
IMO, this routine should be considered as a replacement for the GNU memcpy routine so I can just remove it from the OpenSHMEM library and use memcpy.