Well, the numbers that I published are from the built in memcpy()--which one would hope has been optimized at least a little. (Although my ten minute hack is pretty close to the library function.) I also noticed, after publishing my data, that something strange was going on in the last run. All observations should have 160 samples. All of the trials were not being displayed. When I switched back to my barrier, the trials appear correctly. Apparently the WAND instruction is "experimental". Since I found no documentation as to what is "experimental" about WAND, i.e. what works and what doesn't, I will modify my barriers to idle and wait for interrupts. The current polling in the barriers significantly impacts the DMA measurements as it generates mesh traffic. Although, perhaps that better reflects a real world benchmark? The reason I tested with all of the cores executing similar tests is that it, hopefully, would be rare for only one core to be working on a problem.
As an aside, my C++ locmemcpy() did generate ldrd/strd instructions.