Parallella Community

by **stevenc** » Wed Sep 03, 2014 5:39 pm

Hi, I am experimenting with some image processing using COPRTHR. I have written a program that reads a 640x480 image from a webcam (OpenCV), then sends the image to the epiphany to perform Sobel edge detection. It works, but it is extremely slow. It is taking about 5.7 seconds to process a single frame from the camera. This is shocking because if I run the same Sobel algorithm in series on the CPU, it only takes about 0.16 seconds to process each frame. I assume I'm doing something wrong, but what could possibly cause it to be so slow?

Thanks

Program:

by **aolofsson** » Wed Sep 03, 2014 6:55 pm

by **stevenc** » Wed Sep 03, 2014 9:22 pm

Thanks for looking at the code. I changed the `tmp` variable to an int, and that did help quite a bit. Also, I removed the if statement and the modulus that went along with it, so that helped as well. Now I am getting ~0.78 seconds per frame. This is much better, but it is still about 5 times slower than the basic sequential program, which is not good. Is there anything else I should do? Is the OpenCL implementation just slow and I should switch to the raw Epiphany SDK or something else?

by **aolofsson** » Wed Sep 03, 2014 11:03 pm

Where are your vectors placed? There is no caching so to get good performance the data must be explicitly brought in/placed in local mem?

Here is one example showing concept.

http://www.adapteva.com/white-papers/ef ... ng-opencl/

by **toralf** » Thu Sep 04, 2014 3:08 pm

by **aolofsson** » Thu Sep 04, 2014 4:20 pm

I think the answer is more complicated, as no application is defined by one operation. Here is an interesting discussion initiated by solardize (who knows what he is talking about :-)

viewtopic.php?f=8&t=256&hilit=alexander

Also, check out the Epiphany benchmark by openwall on 'bcrypt':

http://www.openwall.com/presentations/P ... Slides.pdf

by **stevenc** » Thu Sep 04, 2014 5:36 pm

So here's the experiment I did. I commented out the entire body of the kernel. This dropped the time to ~.38 seconds per frame. I assume this means that there is ~.38 seconds per frame of overhead (memory movement and whatnot) and ~.4 seconds per frame of computation. Does this sound correct? If that is the case then even if I ignore the memory situation, it is still slower than the sequential program.

As for where the memory is placed. What I tried to do is place the entire frame in a single buffer in the device shared memory, so the frame would be transfered once and each epiphany core could just read the part it needs (there is no core-to-core communication needed). This was what I intended to do, but perhaps I misunderstand the architecture.

by **dar** » Wed Sep 10, 2014 2:52 pm

Kernel launch time was previously measured at 140 msec. This can be a problem for short kernels, but its the cost of the machinery that is needed to get everything to work. The eloader requires a filesystem interaction, for example. In such a situation, the best solution is to do something you cannot do in OpenCL but you can do using what is provided with COPRTHR - use persistent kernels - concept is not new, its how pthread applications often work. A discussion goes beyond what I can post here immediately, but possibly I can try to provide something concrete to go with. If you look at one of the blog posts I wrote on the "new low-level coprthr API" you will see the pieces of the puzzle at least. Something to remember is that Parallella is not like a workstation with a graphics card - its very different, has some things that it cannot do but has other things that are very useful that a GPU cannot do.

I noticed you must make a copy of the image to an allocation obtained with clmalloc() from one that I presume is stored within some OpenCV object. Ultimately it would be better to just use a clmalloc'd allocation of shareable device memory directly. The problem people run into is that, as in this case, they do not actually control the allocation. If you can provide OpenCV with an allocator, you can do what we do with std::vector<> and boost::multi_array where we do just this - provide a clmalloc() based allocator and then like magic our memory is shareable. (And for an even faster design you could use UVA which is not officially supported, but we had a switch for this in COPRTHR that effectively enabled a unified address space - for absolute speed that would eliminate all offload copies.)

Small point, I noticed you commented out clwait() but kept clflush() - you probably want to reverse this - clflush() was used to help along older GPU SDKs that were not designed to execute kernels as soon as they were enqueued. COPRTHR design for Epiphany will take up anything in the queue as soon as it shows up. However, since your calls use the NOWAIT flag you must wait on completion before using the results.

Its useful that your code is posted on github - let me try to take a look and see if any immediate suggestions come to mind as a start.

-DAR

by **stevenc** » Wed Sep 10, 2014 4:46 pm

DAR,

Making the kernel persistent does seem like a good idea, but a concrete example would help. I read through the blog post, but I'm not sure which part would make the kernel persistent or really how to go about doing this.

I have changed the code to use the shared memory as much as possible. So the grayscale image and the output image are placed directly in shared memory without needing to copy. I also switched the commenting on clflush and clwait. I didn't notice any difference in behavior with/without either of these calls, but I suppose wait is a good one to use anyway.

With this the time is down to ~0.67 seconds per frame, so it did help a little.

Thank you for taking the time to look into this

by **sebraa** » Thu Sep 11, 2014 1:45 pm

Think about that multiple cores fighting for the shared memory extremely drags down performance. In one instance for me, enabling 15 additional cores (executing code from shared memory, with local data) meant a difference between "one iteration per second" and "no iterations after 20 minutes yet" of the algorithm I use.

Parallella Community

Sobel is ~35 times slower using OpenCL

Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Who is online