Very slow OpenCL execution time (with example code).

Moderator: dar

Very slow OpenCL execution time (with example code).

Postby CyberHiPriest » Tue Apr 26, 2016 2:47 pm

Hi there,

I've been playing a bit with this awesome board and wanting to do some raymarching gfx with the 16 cores.

However, I hit a bit of a wall when it comes to OpenCL performance on the Parallella board.
It seems that using clforka() in a loop is very resource-intensive as a simple example runs 100x slower on the Epiphany chip than it does on the dualcore ARM.

I'm probably doing something stupid, but can't really tell what and how to fix it.

I made a simple program to showcase the delays in clforka(), so people smarter than me can take a look at it.
raymarch.tar.gz
raymarching example code
(2.56 KiB) Downloaded 591 times


You can compile & run it on a Linux x86_64 host, Parallella ARM and Epiphany by giving the program different parameters:
Code: Select all
(for the Epiphany)
# ./raymarch -a -nthreads 16
(for the ARM CPUs)
# ./raymarch -c -nthreads 2
(on your host core-i7 or something like that)
# ./raymarch -c -nthreads 8
(currently it does not run out-of-the-box on nvida/amd/intel GPU's, but has support for it)
# ./raymarch -g -nthreads 256


I used the browndeer's coprthr/opencl wrapper at the default location (/usr/local/browndeer) for both the parallella and on my x86_64 PC.
The only extra dependency is libSDL1.2-dev.


My idea is that clforka() does the transfer of RTE + ocl_kernel to all the cores and sets up & starts all the cores behind the scenes. This is not needed to be done every loop in my example, I just want to feed the OpenCL kernel updated parameters.
Server park:
Parallella Z7010 w/ E16G03 (microserver) with Linaro Ubuntu 15.04
PowerPC405 board with custom Linux 2.6.35
PowerPC e300c3 board with custom Linux 2.6.38
Marvell 88F6192 board with Arch Linux
MIPS64r3 board with custom debian Linux 2.2.27
User avatar
CyberHiPriest
 
Posts: 4
Joined: Tue Apr 26, 2016 2:12 pm

Re: Very slow OpenCL execution time (with example code).

Postby jar » Tue Apr 26, 2016 8:46 pm

The OpenCL startup time is problematic because the eSDK serial loader is slow. This is being addressed in COPRTHR 2 (https://arxiv.org/abs/1604.04207). It's going to get much better soon.

Alternatively, you need to you something called "persistent threads" if your clforka() call is inside a loop. Basically, move the loop inside the kernel device code instead of the host code. I have not looked at your code.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Very slow OpenCL execution time (with example code).

Postby CyberHiPriest » Wed Apr 27, 2016 3:05 pm

After reading the other topic on the forum (Sobel 35x slower)
viewtopic.php?f=18&t=1703

I started to investigate the assembly and saw some calls to generic float/double code that was probably executed in the shared DRAM.

'aolofsson' wasn't kidding when he said that such code is 100x slower. More like 200x slower in my example.

Anyway, after adding some 'superfast' routines (sfdiv, sfsin, ...) and making sure all constants are floats, my frame-time went down from 9000(!) milliseconds to an insane 35 milliseconds! Very cool!! :D

Here's the adjusted raymarch_kern.cl code for those who want to see the difference too.

Code: Select all
#ifdef E32_DRAM_ZEROPAGE
int e_dma_copy(void *, void *, unsigned long);
#else
#include <string.h>
#define e_dma_copy(x,y,z) memcpy(x,y,z)
#endif

#define PI   3.141592653589793f
#define PI_2 1.57079632679489661923f
#define PI_4 0.78539816339744830962f

float sfinv(float a)
{
   union fu32_u {
      float f;
      uint32_t u;
   };
   union fu32_u x;
   x.f = a;
   x.u = 0x7eeeeeee - x.u;
   x.f = x.f * (2.0f - a * x.f);
   return x.f;
}

float sfdiv(float a, float b)
{
   return a * sfinv(b);
}

float sffloor(float x)
{
   return (float)((int)x);
}

float sfmod(float x, float y)
{
   return x - y * sffloor(x / y);
}

float sfsin(const float a)
{
   float val = 1.0f;
   float theta = sfmod(a, 2.0f * PI);
   val = 1.0f - theta * theta * 0.083333333f * 0.076923077f * val;
   val = 1.0f - theta * theta * 0.1f * 0.090909091f * val;
   val = 1.0f - theta * theta * 0.125f * 0.111111111f * val;
   val = 1.0f - theta * theta * 0.166666667f * 0.142857143f * val;
   val = 1.0f - theta * theta * 0.25f * 0.2f * val;
   val = 1.0f - theta * theta * 0.5f * 0.333333333f * val;
   return theta * val;
}

float sfcos(const float a)
{
   float val = 1;
   float theta = sfmod(a, 2.0f * PI);
   val = 1.0f - theta * theta * 0.083333333f * 0.090909090f * val;
   val = 1.0f - theta * theta * 0.10000000f * 0.11111111f * val;
   val = 1.0f - theta * theta * 0.12500000f * 0.14285714f * val;
   val = 1.0f - theta * theta * 0.16666667f * 0.20000000f * val;
   val = 1.0f - theta * theta * 0.25000000f * 0.33333333f * val;
   val = 1.0f - theta * theta * 0.50000000f * 1.00000000f * val;
   return val;
}

__kernel void raymarch_kern(float time, uint pitch, uint xres, uint yres, __global uint * frame)
{
   uint line[xres];
   int i, j;
   float2 fres;
   fres.x = (float)xres;
   fres.y = (float)yres;
   int y = get_global_id(0);
   uint *dst = frame + ((yres - y) * xres);
   float2 c = (float2) { 0, y };
   float4 color;
   uint fcolor;
   for (i = 0; i < xres; i++) {
      c.x = (float)i;
      color = (float4) { sfdiv(c.x, fres.x),
               sfdiv(c.y, fres.y),
              0.5f + 0.5f * sfsin(time),
              1.f};
      fcolor = 0xff000000 +
         ((uint) (color.x * 255.0f) << 16) +
         ((uint) (color.y * 255.0f) << 8) +
         ((uint) (color.z * 255.0f));
      line[i] = fcolor;
   }
   e_dma_copy(dst, line, pitch);
}
Server park:
Parallella Z7010 w/ E16G03 (microserver) with Linaro Ubuntu 15.04
PowerPC405 board with custom Linux 2.6.35
PowerPC e300c3 board with custom Linux 2.6.38
Marvell 88F6192 board with Arch Linux
MIPS64r3 board with custom debian Linux 2.2.27
User avatar
CyberHiPriest
 
Posts: 4
Joined: Tue Apr 26, 2016 2:12 pm


Return to OpenCL

Who is online

Users browsing this forum: No registered users and 1 guest

cron