Best size pof workgroup

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Best size pof workgroup

Postby bird12358 » Sun Nov 01, 2015 7:33 pm

Hello,

I would like to have more information about the workgroup size.

I found lots of examples with lots of different size of the workgroups.

How could I know the best size of my workkgroup?

For example that kernel, this is a lucas and kanade motion filter. I gave you only a part of the algorithm:
Code: Select all
#define FRAD 4
#define eps 0.0000001f;
#define LOCAL_X 16
#define LOCAL_Y 8

__kernel void lkflow(
    __read_only image2d_t I,
    __read_only image2d_t Ix,
    __read_only image2d_t Iy,
    __read_only image2d_t G,
    __read_only image2d_t J_float,
    __global float2 *guess_in,
    int guess_in_w,
    __global float2 *guess_out,
    int guess_out_w,
    int guess_out_h,
    int use_guess )
{
   // declare some shared memory
   __local int smem[2*FRAD + LOCAL_Y][2*FRAD + LOCAL_X] ;
   __local int smemIy[2*FRAD + LOCAL_Y][2*FRAD + LOCAL_X] ;
   __local int smemIx[2*FRAD + LOCAL_Y][2*FRAD + LOCAL_X] ;

   // Create sampler objects.  One is for nearest neighbour, the other fo
   // bilinear interpolation
    sampler_t bilinSampler = CLK_NORMALIZED_COORDS_FALSE |
                           CLK_ADDRESS_CLAMP_TO_EDGE |
                           CLK_FILTER_LINEAR ;
    sampler_t nnSampler = CLK_NORMALIZED_COORDS_FALSE |
                           CLK_ADDRESS_CLAMP_TO_EDGE |
                           CLK_FILTER_NEAREST ;

   // Image indices. Note for the texture, we offset by 0.5 to use the centre
   // of the texel.
    int2 iIidx = { get_global_id(0), get_global_id(1)};
    float2 Iidx = { get_global_id(0)+0.5, get_global_id(1)+0.5 };

   // load some data into local memory because it will be re-used frequently
   // load upper left region of smem
   int2 tIdx = { get_local_id(0), get_local_id(1) };
   smem[ tIdx.y ][ tIdx.x ] = read_imageui( I, nnSampler, Iidx+(float2)(-FRAD,-FRAD) ).x;
   smemIy[ tIdx.y ][ tIdx.x ] = read_imageui( Iy, nnSampler, Iidx+(float2)(-FRAD,-FRAD) ).x;
   smemIx[ tIdx.y ][ tIdx.x ] = read_imageui( Ix, nnSampler, Iidx+(float2)(-FRAD,-FRAD) ).x;

   // upper right
   if( tIdx.x < 2*FRAD ) {
      smem[ tIdx.y ][ tIdx.x + LOCAL_X ] = read_imageui( I, nnSampler, Iidx+(float2)(LOCAL_X - FRAD,-FRAD) ).x;
      smemIy[ tIdx.y ][ tIdx.x + LOCAL_X ] = read_imageui( Iy, nnSampler, Iidx+(float2)(LOCAL_X - FRAD,-FRAD) ).x;
      smemIx[ tIdx.y ][ tIdx.x + LOCAL_X ] = read_imageui( Ix, nnSampler, Iidx+(float2)(LOCAL_X - FRAD,-FRAD) ).x;

   }
   // lower left
   if( tIdx.y < 2*FRAD ) {
      smem[ tIdx.y + LOCAL_Y ][ tIdx.x ] = read_imageui( I, nnSampler, Iidx+(float2)(-FRAD, LOCAL_Y-FRAD) ).x;
      smemIy[ tIdx.y + LOCAL_Y ][ tIdx.x ] = read_imageui( Iy, nnSampler, Iidx+(float2)(-FRAD, LOCAL_Y-FRAD) ).x;
      smemIx[ tIdx.y + LOCAL_Y ][ tIdx.x ] = read_imageui( Ix, nnSampler, Iidx+(float2)(-FRAD, LOCAL_Y-FRAD) ).x;

   }
   // lower right
   if( tIdx.x < 2*FRAD && tIdx.y < 2*FRAD ) {
      smem[ tIdx.y + LOCAL_Y ][ tIdx.x + LOCAL_X ] = read_imageui( I, nnSampler, Iidx+(float2)(LOCAL_X - FRAD, LOCAL_Y - FRAD) ).x;
      smemIy[ tIdx.y + LOCAL_Y ][ tIdx.x + LOCAL_X ] = read_imageui( Iy, nnSampler, Iidx+(float2)(LOCAL_X - FRAD, LOCAL_Y - FRAD) ).x;
      smemIx[ tIdx.y + LOCAL_Y ][ tIdx.x + LOCAL_X ] = read_imageui( Ix, nnSampler, Iidx+(float2)(LOCAL_X - FRAD, LOCAL_Y - FRAD) ).x;

   }
   barrier(CLK_LOCAL_MEM_FENCE);


LOCAL_X and LOCAL_Y is the size of the workgroup and FRAD is because of the size of the kernel apply on the images.

Are the size of LOCAL_X and LOCAL_Y really important? Can I modify these size to 32 and 16?
Because for me local memory is use to reduce the acces of the data in the workItem. If there are too much workgroup the WorkItems lost to much time in the copy of the global data.
So if finally the size of the Workgroup is equal to the size of the image, in that case there will have less time lost?

Can someone help me about these size?

Best regard.
bird12358
 
Posts: 5
Joined: Thu May 07, 2015 11:20 am

Re: Best size pof workgroup

Postby jar » Mon Nov 02, 2015 8:49 pm

On the Epiphany, more than one work item per work group is inefficient. The OpenCL semantics of multiple work items per work group are made for GPU-like architectures, not Epiphany.

You want to use a work group size of 1x16, 2x8, 4x4, 8x2, or 16x1 (i.e. exactly 16 threads). Use an additional loop within your kernel to create more work ("persistent threads" is the term in OpenCL)

Code: Select all
#define FRAD 4
#define eps 0.0000001f;
#define LOCAL_X 16
#define LOCAL_Y 8
   __local int smem[2*FRAD + LOCAL_Y][2*FRAD + LOCAL_X] ;
   __local int smemIy[2*FRAD + LOCAL_Y][2*FRAD + LOCAL_X] ;
   __local int smemIx[2*FRAD + LOCAL_Y][2*FRAD + LOCAL_X] ;


This translates to smem[16][24] = 1536 bytes. You have three arrays, so 4.5 KB.

If you changed LOCAL_X to 32 and LOCAL_Y to 16, that would result in smem[24][40], which is 3840 bytes, or 11.25 KB total. Depending on the size of your kernel and other program/data, you may be running out of space with the 32KB. If you are using multiple work items per work group, you definitely will run out of space.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am


Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 12 guests

cron