Sobel is ~35 times slower using OpenCL

Moderator: dar

Sobel is ~35 times slower using OpenCL

Postby stevenc » Wed Sep 03, 2014 5:39 pm

Hi, I am experimenting with some image processing using COPRTHR. I have written a program that reads a 640x480 image from a webcam (OpenCV), then sends the image to the epiphany to perform Sobel edge detection. It works, but it is extremely slow. It is taking about 5.7 seconds to process a single frame from the camera. This is shocking because if I run the same Sobel algorithm in series on the CPU, it only takes about 0.16 seconds to process each frame. I assume I'm doing something wrong, but what could possibly cause it to be so slow?

Thanks

Program: https://github.com/sclukey/parallella-opencv-coprthr
stevenc
 
Posts: 13
Joined: Fri Aug 29, 2014 4:44 pm

Re: Sobel is ~35 times slower using OpenCL

Postby aolofsson » Wed Sep 03, 2014 6:55 pm

Thanks for sending the git repo for review. Great to see!

Some comments:
-Doubles are emulated in software and will pull in code from external DRAM (100x slower than from local memory)
-Modulo and division operations can be slow.(same reason as above)
-The floating point compare at the bottom is going to be slow.(same as above)
-Integer multiplication can be slow due to mode switch.(not the case here)
-Any signed integer work < 32 bits tends to be slow due to lack of sign extension on loads (not the case here, but beware...)

Your kernel code example below. Made some hacks to code below just to compile with e-gcc and look at it with e-objdump. Perhaps others can comment whether there is an easier way to do this without hacking up the code?

Code: Select all
  main(
   unsigned int n,
   unsigned int line,
   unsigned int k,
   char* aa,
   char* bb
   )
{
   int i,m;

   int n16 = n/16;
   int m16 = n%16;

   int ifirst = k*n16 + ((k>m16)? 0:k);
   int iend   = ifirst + n16 + ((k<m16)? 1:0);

   double tmp;

   for(i=ifirst; i<iend; i++) {
      
      m = i % line;

      if (m == 0 || m == line-1 || i < line || n-i < line)
         bb[i] = 0;
      else {
         tmp = abs(-     aa[i-line-1]
                   - 2 * aa[i-1]
                   -     aa[i+line-1]
                   +     aa[i-line+1]
                   + 2 * aa[i+1]
                   +     aa[i+line+1])
             + abs(      aa[i-line-1]
                   + 2 * aa[i-line]
                   +     aa[i-line+1]
                   -     aa[i+line-1]
                   - 2 * aa[i+line]
                   -     aa[i+line+1]);
         bb[i] = tmp > 255 ? 255 : tmp;
      }
   }
}
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Sobel is ~35 times slower using OpenCL

Postby stevenc » Wed Sep 03, 2014 9:22 pm

Thanks for looking at the code. I changed the `tmp` variable to an int, and that did help quite a bit. Also, I removed the if statement and the modulus that went along with it, so that helped as well. Now I am getting ~0.78 seconds per frame. This is much better, but it is still about 5 times slower than the basic sequential program, which is not good. Is there anything else I should do? Is the OpenCL implementation just slow and I should switch to the raw Epiphany SDK or something else?
stevenc
 
Posts: 13
Joined: Fri Aug 29, 2014 4:44 pm

Re: Sobel is ~35 times slower using OpenCL

Postby aolofsson » Wed Sep 03, 2014 11:03 pm

Where are your vectors placed? There is no caching so to get good performance the data must be explicitly brought in/placed in local mem?

Here is one example showing concept.

http://www.adapteva.com/white-papers/ef ... ng-opencl/
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Sobel is ~35 times slower using OpenCL

Postby toralf » Thu Sep 04, 2014 3:08 pm

aolofsson wrote:-Modulo and division operations can be slow.(same reason as above)
Which would parallela not let be a candidate for crypto tasks, or ?
toralf
 
Posts: 8
Joined: Thu Nov 07, 2013 3:41 pm

Re: Sobel is ~35 times slower using OpenCL

Postby aolofsson » Thu Sep 04, 2014 4:20 pm

I think the answer is more complicated, as no application is defined by one operation. Here is an interesting discussion initiated by solardize (who knows what he is talking about :-)

viewtopic.php?f=8&t=256&hilit=alexander

Also, check out the Epiphany benchmark by openwall on 'bcrypt':

http://www.openwall.com/presentations/P ... Slides.pdf
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Sobel is ~35 times slower using OpenCL

Postby stevenc » Thu Sep 04, 2014 5:36 pm

So here's the experiment I did. I commented out the entire body of the kernel. This dropped the time to ~.38 seconds per frame. I assume this means that there is ~.38 seconds per frame of overhead (memory movement and whatnot) and ~.4 seconds per frame of computation. Does this sound correct? If that is the case then even if I ignore the memory situation, it is still slower than the sequential program.

As for where the memory is placed. What I tried to do is place the entire frame in a single buffer in the device shared memory, so the frame would be transfered once and each epiphany core could just read the part it needs (there is no core-to-core communication needed). This was what I intended to do, but perhaps I misunderstand the architecture.
stevenc
 
Posts: 13
Joined: Fri Aug 29, 2014 4:44 pm

Re: Sobel is ~35 times slower using OpenCL

Postby dar » Wed Sep 10, 2014 2:52 pm

Kernel launch time was previously measured at 140 msec. This can be a problem for short kernels, but its the cost of the machinery that is needed to get everything to work. The eloader requires a filesystem interaction, for example. In such a situation, the best solution is to do something you cannot do in OpenCL but you can do using what is provided with COPRTHR - use persistent kernels - concept is not new, its how pthread applications often work. A discussion goes beyond what I can post here immediately, but possibly I can try to provide something concrete to go with. If you look at one of the blog posts I wrote on the "new low-level coprthr API" you will see the pieces of the puzzle at least. Something to remember is that Parallella is not like a workstation with a graphics card - its very different, has some things that it cannot do but has other things that are very useful that a GPU cannot do.

I noticed you must make a copy of the image to an allocation obtained with clmalloc() from one that I presume is stored within some OpenCV object. Ultimately it would be better to just use a clmalloc'd allocation of shareable device memory directly. The problem people run into is that, as in this case, they do not actually control the allocation. If you can provide OpenCV with an allocator, you can do what we do with std::vector<> and boost::multi_array where we do just this - provide a clmalloc() based allocator and then like magic our memory is shareable. (And for an even faster design you could use UVA which is not officially supported, but we had a switch for this in COPRTHR that effectively enabled a unified address space - for absolute speed that would eliminate all offload copies.)

Small point, I noticed you commented out clwait() but kept clflush() - you probably want to reverse this - clflush() was used to help along older GPU SDKs that were not designed to execute kernels as soon as they were enqueued. COPRTHR design for Epiphany will take up anything in the queue as soon as it shows up. However, since your calls use the NOWAIT flag you must wait on completion before using the results.

Its useful that your code is posted on github - let me try to take a look and see if any immediate suggestions come to mind as a start.

-DAR
dar
 
Posts: 90
Joined: Mon Dec 17, 2012 3:26 am

Re: Sobel is ~35 times slower using OpenCL

Postby stevenc » Wed Sep 10, 2014 4:46 pm

DAR,

Making the kernel persistent does seem like a good idea, but a concrete example would help. I read through the blog post, but I'm not sure which part would make the kernel persistent or really how to go about doing this.

I have changed the code to use the shared memory as much as possible. So the grayscale image and the output image are placed directly in shared memory without needing to copy. I also switched the commenting on clflush and clwait. I didn't notice any difference in behavior with/without either of these calls, but I suppose wait is a good one to use anyway.

With this the time is down to ~0.67 seconds per frame, so it did help a little.

Thank you for taking the time to look into this
stevenc
 
Posts: 13
Joined: Fri Aug 29, 2014 4:44 pm

Re: Sobel is ~35 times slower using OpenCL

Postby sebraa » Thu Sep 11, 2014 1:45 pm

Think about that multiple cores fighting for the shared memory extremely drags down performance. In one instance for me, enabling 15 additional cores (executing code from shared memory, with local data) meant a difference between "one iteration per second" and "no iterations after 20 minutes yet" of the algorithm I use.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Next

Return to OpenCL

Who is online

Users browsing this forum: Google [Bot] and 4 guests

cron