Multi-task execution on Parallella

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Multi-task execution on Parallella

Postby vanchiramani » Wed Feb 08, 2017 9:54 am

Hello Forum members.

I have an application that employs pipeline parallelism. Suppose, this application consists of tasks T0, T1, ... T15 with each of them executing different functions. My objective is to map these tasks to different Ephiphany cores using COPRTHR-2 and OPENSHMEM libraries.

Q1. Since each of the threads execute different functions, the program cannot be written like:
Code: Select all
if(tid == 0)
    call_fn0();
else if(tid == 1)
    call_fn1();
:
else
    call_fn15();

as it increases the SPM space requirement. Hence, I want to have different kernel codes for each of the threads. I searched but could not find any documentation on how I can launch different threads on different cores. Is there any documentation on this?

Q2. Suppose in this application, a task group T0-T4 wants to synchronize on a barrier or wants to have a lock variable. How can this be implemented?

Any help really appreciated. Thanks a lot in advance.

Best regards
V Vanchinathan
vanchiramani
 
Posts: 13
Joined: Tue Mar 29, 2016 8:41 am

Re: Multi-task execution on Parallella

Postby jar » Wed Feb 08, 2017 4:32 pm

V,

This is a tricky request, but I'm gonna take care of you :-)

Understand that the OpenSHMEM programming model is Single Program, Multiple Data streams (SPMD) in Flynn's taxonomy, but it seems like the natural paradigm you're requesting is Multiple Programs, Multiple Data streams (MPMD). We may still be able to make it work, however!

OpenSHMEM uses a Partitioned Global Address Space (PGAS) memory model and requires that each processing element (PE) or core have a symmetric memory footprint. This enables implicit remote address calculation when moving data from one PE to a remote PE. So all of your allocations for each subroutine will need to occur upfront. I hope they're not wildly different for each function and you can re-use allocations between the subroutines.

A1:
The COPRTHR-2 SDK enables a dynamic call feature. You can read about it here: https://arxiv.org/abs/1604.04207

Each of your call_fnN subroutines should be marked with __dynamic_call. Below is an example of how you would do it.

The OpenSHMEM standard says that the calls to shmem_malloc() must be symmetric. I haven't currently forced that within the implementation, but undefined behavior will occur if you break that promise.

A2:

RTFM on shmem_barrier (http://openshmem.org/site/sites/default ... 1.3.pdf#54). You'll see in the code how to do it. I tested it, and it works.

Good luck!

test.c:
Code: Select all
#include <coprthr.h>
#include <shmem.h>

long pSync[SHMEM_BARRIER_SYNC_SIZE];

#define FNX(X) \
void __dynamic_call call_fn##X(void) \
{ host_printf("fn%d\n", X); }

#define FNXB(X,START,SIZE) \
void __dynamic_call call_fn##X(void) \
{ \
   shmem_barrier(START,0,SIZE,pSync); \
   host_printf("fn%d barrier\n", X); \
}

FNXB(0,0,5) FNXB(1,0,5) FNXB(2,0,5) FNXB(3,0,5) FNXB(4,0,5)
FNX(5) FNX(6) FNX(7) FNX(8) FNX(9) FNX(10)
FNX(11) FNX(12) FNX(13) FNX(14) FNX(15)

void (*pfn[])(void) = {
   &call_fn0,  &call_fn1,  &call_fn2,  &call_fn3,
   &call_fn4,  &call_fn5,  &call_fn6,  &call_fn7,
   &call_fn8,  &call_fn9,  &call_fn10, &call_fn11,
   &call_fn12, &call_fn13, &call_fn14, &call_fn15
};

int main(void)
{
   for (int i = 0; i < SHMEM_BARRIER_SYNC_SIZE; i++) pSync[i] = SHMEM_SYNC_VALUE;
   shmem_init();
   pfn[shmem_my_pe()]();
}


Compile with:
Code: Select all
coprcc -fhost -fdynamic-calls -std=c99 -I. -I/usr/local/browndeer/coprthr2/include -I../src/ -L/usr/local/browndeer/coprthr2/lib -lcoprthr2_dev -lcoprthr_hostcall -lesyscall -le-lib -L$PATH_TO_OPENSHMEM/src/ -lshmem test.c -o test.x


Run with:
Code: Select all
coprsh -np 16 ./test.x


Output:
Code: Select all
COPRTHR-2.0.1 (Anthem) build 20170131.1051
fn5
fn6
fn7
fn8
fn9
fn10
fn11
fn12
fn13
fn14
fn15
fn0 barrier
fn1 barrier
fn2 barrier
fn3 barrier
fn4 barrier


Edited for clarity and added a function call table to avoid using a switch or if/else if clause.
User avatar
jar
 
Posts: 179
Joined: Mon Dec 17, 2012 3:27 am

Re: Multi-task execution on Parallella

Postby vanchiramani » Thu Feb 09, 2017 6:49 am

Dear Jar

Thanks a lot for your reply. I can successfully run the sample code that was provided. However, I have some difficulty in understanding certain concepts.

I have been programming for Parallela board using the traditional COPRTHR-2 stack, in which the host ARM processor generates data. allocates variables using coprthr_mem / copies generated data, launches threads and finally obtains the output and verifies if the output is correct. Inside the device kernel, I use SHMEM for inter-thread communication. However, in the sample provided, every function is directly run on Epiphany cores.
Q1. I would like to know how to initialize values and verify the output? In the manual that was pointed out, it is mentioned that host functions for I/O file read can directly be called from Epiphany cores. Is there an example of how to do this?

Since each core has a 32KB SPM, instructions corresponding to each thread can fit in its local memory. Traditional COPRTHR-2 directly loads these functions to local memory before kernel launch. Hence, I was hoping that we could achieve a similar thing where each thread will have its functions loaded in local memory. When I converted an existing void func() to __dynamic_call void func(), the execution time increased from 30 ms to 1200 ms.
Q2. If the above is not possible, do all functions that are called within each thread be declared using __dynamic_call?
Q3. There are some numbers regarding the overhead of using __dynamic__call functions. Will the overhead be high when there are many functions called within a thread?
Q4. Currently coprthr_dexec(dd,num_thr,kernel,(void*)&args_mem, 0); allows us to launch num_thr kernels. In this function, threads are launched starting from Core 0. Even if the number of threads is less than 16, we cannot make used of unused cores. Is there a way by which we can launch different e32 kernels on a particular core id?

Thanks again!
V Vanchinathan
vanchiramani
 
Posts: 13
Joined: Tue Mar 29, 2016 8:41 am

Re: Multi-task execution on Parallella

Postby jar » Thu Feb 09, 2017 3:28 pm

A1:
Keep using your off-chip code to initialize data and do whatever you have to do. The above example was meant to be a demonstration, not your particular solution.

The stdio routines supported are defined in host_stdio.h. As far as I know, they should be functionally equivalent to stdio.h. You just have to attach the prefix 'host_' or 'phost_' to them. The prefix 'host_' allows for individual cores to call the routine, while the 'phost_' implies a parallel operation.

For example, to print:
Code: Select all
host_printf("hello, world\n");

Or an ordered, symmetric, parallel operation (this requires all cores to call it at the same time):
Code: Select all
phost_printf("hello, world\n");


Use the same method for host_fopen, etc.

A2/A3:
If you want true MPMD, you can't also use OpenSHMEM to communicate between the cores. You would need some method to discover the remote address of the resource you want to communicate with, whether it's statically defined or passed at runtime. OpenSHMEM and MPMD are incompatible. But you could use pieces of the library, like the high-performance memcpy (and non-blocking memcpy)

Your execution time increasing is a direct result of dynamically copying the remaining routines at runtime rather than upfront. Any subsequent calls are instant after the first call dynamically loads the routines.

You don't have to define all of your functions as __dynamic_call. Performance is dependent on how many dynamic calls and how large they are.

A4:
This is something dar is aware of, but it's probably lower on the list of priorities. You can email him directly (drichie + browndeer . com).

I haven't tested the OpenSHMEM library in a configuration like this either. Having two separate parallel workloads probably wont work right now. I would only consider supporting this after COPRTHR-2 does.
User avatar
jar
 
Posts: 179
Joined: Mon Dec 17, 2012 3:27 am

Re: Multi-task execution on Parallella

Postby vanchiramani » Fri Feb 10, 2017 8:38 am

Dear Jar

Thanks a lot for patiently answering the questions so far.

For dynamic calls, I run the functions once before measuring the time using timers. However, the execution time is very high when compared to non-dynamic calls. This led to my previous question.

Following is the device code (memory_device.c):
Code: Select all
#include <coprthr.h>
#include <host_stdio.h>
#include "e_lib.h"

typedef struct {
        int n; int* cc;
} my_args_t;

e_ctimer_config_t event_list[] =
{
        E_CTIMER_CLK, E_CTIMER_IDLE, E_CTIMER_IALU_INST, E_CTIMER_FPU_INST,     E_CTIMER_DUAL_INST,
        E_CTIMER_E1_STALLS, E_CTIMER_RA_STALLS, E_CTIMER_EXT_FETCH_STALLS, E_CTIMER_EXT_LOAD_STALLS
};

#pragma GCC push_options
#pragma GCC optimize("O0")
unsigned int mystart(int t)
{
        e_ctimer_set(E_CTIMER_1,  E_CTIMER_MAX);
        e_ctimer_start(E_CTIMER_1, event_list[t]);

        return e_ctimer_get(E_CTIMER_1);
}
unsigned int myend()
{
        return e_ctimer_get(E_CTIMER_1);
}
#pragma GCC pop_options

void __dynamic_call func(int* buf_c, int offset, int m)
{
        int i;
        int j = 111;
        // c = a + b
        for(i=offset; i<offset+m; i++)
                buf_c[i] += i+j;
}

void __entry my_thread( void* p )
{
        int i, j;
        unsigned int t0[9], t1[9], elap[9] = {0};

        int tid = coprthr_get_thread_id();
        my_args_t* pargs = (my_args_t*)p;

        int n = pargs->n;
        int* cc = pargs->cc;

        int m = n/16;
        int offset = m*tid;
        int sz = m*sizeof(int);

        void* memfree = coprthr_tls_sbrk(0);
        // allocate local buffers of size n/16
        int* buf_c = (int*)coprthr_tls_sbrk(sz);

        for(j = 0; j < 100; j++)
                func(buf_c, offset, m);

t0[0] = mystart(0);
        for(j = 0; j < n; j++)
        {
                func(buf_c, offset, m);
                coprthr_memcopy_align(cc, buf_c + offset*sizeof(int), sz, COPRTHR2_M_DMA_0);
        }

t1[0] = myend();
elap[0] += t0[0]-t1[0];

        host_printf("id = %d cycles = %d\n", tid, elap[0]);
        // clean up
        coprthr_tls_brk(memfree);
}


Following is the host code (memory_host.c):
Code: Select all
#include <stdio.h>
#include <stdlib.h>

#include "coprthr.h"
#include "coprthr_cc.h"
#include "coprthr_thread.h"

#define SIZE 2048

struct timeval t_startTime, t_stopTime;
unsigned long long elapsedTime;
struct my_args { int n; int* cc; };

int main(int argc, char* argv[])
{
        int i;
        int n = SIZE;

        /* open device for threads */
        int dd = coprthr_dopen(COPRTHR_DEVICE_E32,COPRTHR_O_THREAD);

        /* compile thread function */
        coprthr_program_t prg = coprthr_cc_read_bin("./memory_device.e32",0);
        coprthr_sym_t thr = coprthr_getsym(prg,"my_thread");

        /* allocate memory shared with coprocessor device */
        coprthr_mem_t cc_mem = coprthr_dmalloc(dd,n*sizeof(int),0);
        int* cc = (int*)coprthr_memptr(cc_mem,0);

        /* set args to pass to thread on coprocessor device */
        coprthr_mem_t args_mem = coprthr_dmalloc(dd,sizeof(struct my_args),0);
        struct my_args* pargs = (struct my_args*)coprthr_memptr(args_mem,0);
        pargs->n = n;
        pargs->cc = cc;

        /* initialize A, B, and C arrays */
        for (i=0; i<n; i++) {
                cc[i] = 3;
        }

        gettimeofday(&t_startTime, NULL);

        // Execute kernel on coprocessor device
        coprthr_dexec(dd,16,thr,(void*)&args_mem, 0 );
        coprthr_dwait(dd);

        gettimeofday(&t_stopTime, NULL);
        elapsedTime = (t_stopTime.tv_sec - t_startTime.tv_sec) * 1000000LL + t_stopTime.tv_usec - t_startTime.tv_usec;
        printf("%lld us\n", elapsedTime);

        /* clean up */
        coprthr_dfree(dd,args_mem);
        coprthr_dfree(dd,cc_mem);

        coprthr_dclose(dd);
}


Following is the compilation command:
Code: Select all
cc -O2  -g -I. -I/usr/local/browndeer/coprthr2/include  -c memory_host.c
cc -rdynamic -o memory.x memory_host.o -L/usr/local/browndeer/coprthr2/lib -lcoprthr -lcoprthrcc -lm -ldl
coprcc --info      memory_device.c \
       -o memory_device.e32


Additionally, is the available method for running MPMD on Epiphany is using e_load / e_load_group and remote calls?
vanchiramani
 
Posts: 13
Joined: Tue Mar 29, 2016 8:41 am

Re: Multi-task execution on Parallella

Postby jar » Fri Feb 10, 2017 3:28 pm

Make sure you're compiling with '-fdynamic-calls' as in the example I showed. I think your code is executing them out of global DRAM, resulting in slow performance. It's not enough to just mark up your code. I should have been more explicit.
User avatar
jar
 
Posts: 179
Joined: Mon Dec 17, 2012 3:27 am

Re: Multi-task execution on Parallella

Postby vanchiramani » Sat Feb 11, 2017 8:43 am

Dear Jar

Thanks a lot for your help. It works now.

Best regards
V Vanchinathan
vanchiramani
 
Posts: 13
Joined: Tue Mar 29, 2016 8:41 am


Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 1 guest