Yes, the ACP is an option. I realized I hadn't discussed it just after posting.
On the plus side, you can then use normal memory and don't have to care about cache. If you driver is well written using dma API, then just specifying 'dma-coherent' attribute on the device tree will turn the cache ops or the dma_alloc_coherent into no-ops and return normal memory. So you can plug your HW on ACP or on the HP port and only have to change 1 attr on the device tree. It's also nice that then your application doesn't have to care all that much.
The down side is that the ACP port has less bandwidth available than the HP port. People have measured it at about half the bandwidth of the HP port. That bandwidth also impact the performance of the ARM since it's shared (which a benchmark might not show). You also have to be careful on which protection bits you use during your ACP transfer to make sure not to trash the cache. You want the DMA write to not allocate cache lines, just invalidate them (at least for such large transfers like with the camera). Exchanging small data with the epiphany could be full cacheable and allocate on read/write to speed things up.
Note that the cache trashing might happen if your app does large access to this cacheable zone as well ... which is why a good written app is always better :p
I started a thread a couple days ago on the linux-arm mailling list where all the solutions have been discussed with pro/cons if you want more details.
ATM I've used vldm and vstm for my optimization. See the fast_write method in rpi_cam.c. Since I'm writing to a file descriptor, I just wanted the 'write' to be fast and so I'm doing small 8k write() calls and before that call, I memcpy using vldm/vstm those 8k from the non-cached shared memory into a small local buffer. Now because that local buffer is so small and I overwrite it between each call, it's most likely that the actual writes to it never hit DDR and stay within L1.
A very quick read speed bench I made (basically summing all the int32_t in a large buffer) yields:
- Pure C - cacheable mem : 340 MB/s
- Pure C - non-cacheable : 35 MB/s
- NEON - cacheable mem : 577 MB/s
- NEON - non-cacheable : 482 MB/s
For the non-cacheable mem, preload wouldn't help at all (and probably even be counter productive) because pre-loading is just telling the CPU to load this memory into cache so it's in cache when you really read it. But for non-cacheable mem that won't happen. For cacheable mem, it would probably help yes.
Replacing the memcpy could definitely help. There is a project called 'fastarm' (
https://github.com/hglm/fastarm ) that has optimized memcpy / memset replacements. It also has a benchmark. On my rev1 (with 533M DDR) :
standard memcpy: ( ./benchmark --test 47 --memcpy a )
8M bytes page aligned: 307.47 MB/s
8M bytes page aligned: 307.44 MB/s
new memcpy for cortex using NEON with line size 64, preload offset 192: ( ./benchmark --test 47 --memcpy g )
8M bytes page aligned: 495.45 MB/s
8M bytes page aligned: 495.50 MB/s
But since those are optimized with preload and stuff like that I'm not sure how they'd perform on non-cached memory. A dedicated memcpy for that zone might do better.
Cheers,
Sylvain