I was wrong! When I pulled out the instruction reference PDF, and pulled up the generated assembly code for my locmemcpy() function (I used -S with a verbose option rather than e-objdump), I determined that for the most unwound loop of 8 x 64-bits of moving, the compiler was using auto-incrementing 64-bit ldrd/strd instructions. Furthermore, it was also clever enough to embed some of the other housekeeping calculations between the load and store to optimize. It also adjusted subsequent instructions after the housekeeping was done. It started off with a base address, but when the 8x64 increment was done (between load and stores), it generated instructions that automatically subtracted from the base. Very clever. I was impressed. Hats off to whoever generated that code! It appears to understand the pipeline. I used -O3 to generate the code. I may not have used any optimizations on my prior compiles. Unfortunately, it ran out of other calculations to interleave, which probably generated code with stalls. I think I will try to capture that metric.
The section of code that didn't use the auto increment was the tail where the remainder of 7-1 x 64-bits was executed. It did, however, put the add instructions between the ldr and str instructions again, which was somewhat clever. I am not familiar enough with the instruction set yet to see if it put two 16-bit instructions in, but it was pretty clever. Hand optimization of the generated code will only be slightly more clever--and the payoff isn't that great in the tail. The last tail where the 7-1 x 8-bits transfers were done was also sub-optimal, however, the optimizer was likely confused by the somewhat non-standard, and somewhat optimal, C/C++ code. A little hand optimization would probably also help there as well. It still was reasonably clever.
My conclusion is that the optimizations were pretty good. A person can probably do better, but nonetheless the compiler is pretty good. This is much better than the early (late 70's and early 80's) MIT compilers and assemblers I worked with.
https://en.wikipedia.org/wiki/NuMachine