by dar » Thu Sep 18, 2014 10:42 am
The issue with bb array is likely because your writes are being done with a simple loop, whereas an optimization occurred with caching aa due to re-use in the stencil calculation avoiding costly reads. Caching both aa and bb would make most impact if using the a DMA read/write, for which we had extensions.
As far as making the kernel persistent, I will look at this since the issue is that you might not have access to the API via OpenCL or STDCL at the moment, but only through the "low-level" coprthr API which provides pthreads extended to co-processors, basically the best way to do it is to use calls provided by that API. In the end, the distinctions here are not so dramatic, but more practical. So let me look before making further recommendation. You could introduce a hack like some of the "Ping-Pong" code you might have seen for Epiphany where mailboxes are used, etc. The idea is you want the kernel to wait until signaled by the host; then it does the transform, signals the host, and then it goes back and waits. And of course you need a way to tell it you are done and it should exit. Its basically pthread programming. This would keep a persistent kernel.