Parallella Community

by **dar** » Fri Sep 12, 2014 1:27 am

Is this variation of the code posted to github on a test branch perhaps? I would like to take a look. Such a slowdown should not happen.

by **stevenc** » Fri Sep 12, 2014 2:52 pm

@sebraa: Even if the data is only being read? Multiple cores will read the same location, but each output location is only written to once by a single core. I will try to make a local data version though, it is definitely worth a shot, but it will require quite a bit of manual memory movement.

@dar: Yes, I've just been committing to master since all the changes have helped, but perhaps I should have used a test branch to avoid confusion. This is the latest host code that I was talking about: https://github.com/sclukey/parallella-o ... llel.c#L66

It uses the shared memory as the buffers for the OpenCV IplImage's data so OpenCV will use the shared memory directly.

The version I started this thread with was this: https://github.com/sclukey/parallella-o ... 148f196f1b

by **sebraa** » Mon Sep 15, 2014 12:16 pm

by **dar** » Mon Sep 15, 2014 11:04 pm

by **stevenc** » Wed Sep 17, 2014 6:48 pm

@dar and @sebraa

I have made a new version of the kernel that uses local memory for the data. It folows dar's pseudo-code except that the bb array is global to the tiles, though local to the kernel. It did not change the speed whether bb was local or global to the tiles. These changes have dropped the time down to ~0.42 seconds per frame, which is getting closer to the series speed, it's only about 3.5 times slower now, and it's about 14 times faster than I started with.

Better yet, if I continue to assume there is about .38 seconds of overhead (see post on 9-4), then the kernel computation only takes around 0.04 seconds per frame, which is very good. Seems like now the problem is just the kernel overhead? Perhaps the persistent kernel idea would help this?

This version of the kernel:

by **dar** » Thu Sep 18, 2014 10:42 am

The issue with bb array is likely because your writes are being done with a simple loop, whereas an optimization occurred with caching aa due to re-use in the stencil calculation avoiding costly reads. Caching both aa and bb would make most impact if using the a DMA read/write, for which we had extensions.

As far as making the kernel persistent, I will look at this since the issue is that you might not have access to the API via OpenCL or STDCL at the moment, but only through the "low-level" coprthr API which provides pthreads extended to co-processors, basically the best way to do it is to use calls provided by that API. In the end, the distinctions here are not so dramatic, but more practical. So let me look before making further recommendation. You could introduce a hack like some of the "Ping-Pong" code you might have seen for Epiphany where mailboxes are used, etc. The idea is you want the kernel to wait until signaled by the host; then it does the transform, signals the host, and then it goes back and waits. And of course you need a way to tell it you are done and it should exit. Its basically pthread programming. This would keep a persistent kernel.

by **bytefx** » Thu Sep 25, 2014 3:26 pm

Hi,

i've been following this thread with much interest.

I tried compiling your code at .../local_memory on github.

and there seems to be a problem locating the stdcl.h include and lib. i downloaded and compiled the COPRTHR sdk and installed it.

Your code complains about missing packages:
a. coprthr
b. opencv

i would like to bench mark and study the current framework, it sounds efficient for a sobel edge detector. it would be good to compare to a canny edge algorithm vs. a plain vanilla gradient detection. from experience, generating the gauss is computationally expensive in canny edge and i was wondering what other optimisations exist?

pls can you add a readme file on how to compile and run your code on github or on this thread. My apologies in the distraction.

by **stevenc** » Fri Sep 26, 2014 4:23 pm

So I have not been able to figure out how to do persistent threads with OpenCL, and without them I feel I've maxed out speed of this kernel. So, I wrote the program using the eSDK (with the kinect example as a base). Finally, I can achieve ~0.10 seconds per frame. This is slightly faster than the series version! It seems the algorithm is extremely limited by having to read from 3 different lines of the image to compute each pixel. The memory movement is definitely the slow point.

eSDK version:

@bytefx: I am using pkgconfig to locate those two packages. First of all, OpenCV needs to be installed (I built and installed it from source, though the apt-get version *may* work too). The COPRTHR library is installed by default but the pkg-config file is not. The two files I am using for pkg-config are here:

and they need to be put into `/usr/local/lib/pkgconfig/`. You may also have to adjust the files depending on where your libraries are installed.

by **stevenc** » Fri Sep 26, 2014 4:50 pm

@bytefx: I added a README, and the apt-get version of OpenCV works ok, so that makes it a bit easier.

Thanks for all the help everyone. If there is any other way to make the OpenCL version faster I would love to hear it, the eSDK is quite a bit more tedious to work with, so I would rather stay at a higher level if possible.

by **dar** » Fri Sep 26, 2014 5:19 pm

Easiest way to get persistent threads working is with the coprthr API that provides pthreads for co-processors.
From OpenCL you might be able to use some of the calls, but its going to be non-standard anyway. I thought
about this and the issue will be allocating a shared mutex - this really goes outside of OpenCL. I have been
working with the coprthr API just this morning. If I can I will try to follow up with some guidance and sample
code to show how to do it. The level will be that of pthreads which is really not different from OpenCL.
And for portability, you will essentially have a pthreads implementation, I think.

Parallella Community

Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Re: Sobel is ~35 times slower using OpenCL

Who is online