I agree that it is surprising that auto increment is not used in the compiled code. I found the same to be true of my compiled code. The register optimizations in generated code could be better as well--but that is hopefully always true with hand optimized code. It was clear from the compiled code for my locmemcpy() that hand coding in assembly will improve it. I am not sure whether I want to learn yet another instruction set, however, it may be advantageous here. My C++ locmemcpy() code appears to max out at 2-bytes per cycle.
My locmemcpy() does appear to beat DMA at times, which does surprise me--especially with 1024-byte buffers. I had previously read the 300-byte transfer number referred to by nickoppen somewhere before--don't know where. Consequently, I expected DMA to beat my locmemcpy() always with 1024-byte buffers. It doesn't. That led me to suspect my test methodology.
The current code idles the non-active processors while the single core tests are executed, and wakes them up at the end of the test. Each test is repeated ten times on each individual core, and the combined core test is also repeated ten times.
test.txtStatistics: Posted by GreggChandler — Wed Apr 26, 2017 4:57 pm
]]>