by dar » Wed Sep 10, 2014 2:52 pm
Kernel launch time was previously measured at 140 msec. This can be a problem for short kernels, but its the cost of the machinery that is needed to get everything to work. The eloader requires a filesystem interaction, for example. In such a situation, the best solution is to do something you cannot do in OpenCL but you can do using what is provided with COPRTHR - use persistent kernels - concept is not new, its how pthread applications often work. A discussion goes beyond what I can post here immediately, but possibly I can try to provide something concrete to go with. If you look at one of the blog posts I wrote on the "new low-level coprthr API" you will see the pieces of the puzzle at least. Something to remember is that Parallella is not like a workstation with a graphics card - its very different, has some things that it cannot do but has other things that are very useful that a GPU cannot do.
I noticed you must make a copy of the image to an allocation obtained with clmalloc() from one that I presume is stored within some OpenCV object. Ultimately it would be better to just use a clmalloc'd allocation of shareable device memory directly. The problem people run into is that, as in this case, they do not actually control the allocation. If you can provide OpenCV with an allocator, you can do what we do with std::vector<> and boost::multi_array where we do just this - provide a clmalloc() based allocator and then like magic our memory is shareable. (And for an even faster design you could use UVA which is not officially supported, but we had a switch for this in COPRTHR that effectively enabled a unified address space - for absolute speed that would eliminate all offload copies.)
Small point, I noticed you commented out clwait() but kept clflush() - you probably want to reverse this - clflush() was used to help along older GPU SDKs that were not designed to execute kernels as soon as they were enqueued. COPRTHR design for Epiphany will take up anything in the queue as soon as it shows up. However, since your calls use the NOWAIT flag you must wait on completion before using the results.
Its useful that your code is posted on github - let me try to take a look and see if any immediate suggestions come to mind as a start.
-DAR