Parallel python design

Parallel python design

Postby markd » Wed Jan 16, 2013 7:17 am

After looking at some low level code with the PyMite VM ( http://forums.parallella.org/viewtopic.php?f=9&t=70#p522 ), I would like to explore ideas for the higher level programming model.

In a broad outline, the regular CPython interpreter will run on the host, and it will launch work kernels on the cores. The kernels will be written in a restricted subset of Python and will target the PyMite VM. It's a similar model to OpenCL, except the kernels will be written in Python.

Now how should it look in more detail?
It would be helpful to construct some examples and computational kernels - and I would appreciate some input from others on the forum. The idea is to write the algorithms/kernels in some sort of ideal syntax, and then see how well that can be approximated in Python.
Feel free to borrow from multiprocessing, OpenCL, Coarray Fortran, and any other parallel language/library.

As an example, here's an extremely simple case that sums a list:

Code: Select all
def kernel_sum(arg):
    s = 0.0
    for a in arg:
        s += a
    return s

arg1 = [1,2,3,4,5,6,7,8]

# split the array 'arg1' across 4 instances of 'kernel_sum'
ret = launch_kernel(kernel_sum, arg1, n=4, reduction=sum)

# The previous line could be a convenience function for the following sequence:
# k = create_kernel(kernel_sum, n=4, reduction=sum)
# k.run(arg1)
# ret = k.wait()



Questions to consider:
1. How should the input distinguish between arrays to be split, and arrays/scalars that should be the same on each core?
2. What synchronization elements are needed?
3. How should the shared SDRAM segment be used/accessed? (From reading other forum posts, it's in a separate memory space from the host OS - can't simply share a host data structure)
4. What notation and what data structures should be used for cross core memory accesses?
5. Other?
markd
 
Posts: 11
Joined: Mon Dec 17, 2012 3:28 am

Re: Parallel python design

Postby fredd » Wed Jan 16, 2013 5:45 pm

Just some thoughts on the subject (from one mostly intrested in numerical Python code):

- Two different models for the parallel environment:
-- Each core is a indepentent python interpreter and can only access it's own set of python objects. Shared SDRAM is divided into chunks belonging to a single interpreter at each time. Can communicate with other cores with a message passing interface (and maybe even transfer ownership of python objects, but each object belongs to a single interpreter all time). Could have shared access to blocks of raw memory (as arrays of ints or floats)
-- OR: the cores are seen as threads in a shared python object environment. An object on core A could have a direct reference to a object residing on core B, and objects in SDRAM are completely shared. Need per object locks instead of GIL. Still needs a way to transfer objects between two cores and between corses and SDRAM (because indirect memory access is relative expensive), or some smart caching system.

Personally I prefer the former model (much simpler IMHO), but ideally the system should support both. Explicitly distguish between "local" and "global" objects similar to OpenCL, maybe?

- cython-style compilation of kernels. By adding a few type annotations to Python code and compiling to C, Cython often gives big speedups for (especially numerical) python code. But cython produces pretty big C code (like 10-15 lines of C code for a single line of Python), and have long compilation times, so we probably don't want to use Cython itself, but we could reuse many of its concepts. Maybe one could translate Python bytecode to Epiphany assembler, which calls functions in PyMite to manipulate python objects, and which also can manipulate int/float-arrays directy. I will look into PyMite and see if that's a resonable idea.
fredd
 
Posts: 12
Joined: Wed Dec 19, 2012 1:09 pm

Re: Parallel python design

Postby fredd » Thu Jan 17, 2013 9:26 pm

A more concrete example as you asked for markd! Here is an example of a computional kernel for solving the heat equation (2d PDE: Du/Dt =a*Laplacian(u)) on a grid, by slicing it into 4x4 subgrid each fitting on a core. It uses a mock-up API for message-passing with neighbours.
Code: Select all
def kernel_laplacian(grid,q,tlen):
    xlen, ylen = grid.shape
    new_grid = numpy.zeros((xlen,ylen))
    left_edge = numpy.zeros(ylen)
    right_edge = numpy.zeros(ylen)
    up_edge = numpy.zeros(xlen)
    down_edge = numpy.zeros(xlen)
    for t in range(tlen):
        # we have two DMA engines, so two transfers at a time
        send_array(LEFT,grid[0,:])
        send_array(RIGHT,grid[-1,:])
        recieve_array(LEFT,left_edge)
        recieve_array(RIGHT,right_edge)
        send_array(UP,grid[:,0])
        send_array(DOWN,grid[:,-1])
        recieve_array(UP,up_edge)
        recieve_array(DOWN,down_edge)
        wait_all_transfers()
        for x in range(xlen):
            for y in range(ylen):
                diff = -4*grid[x,y]
                diff += grid[x-1,y] if x>0      else left_edge[y]
                diff += grid[x+1,y] if x<xlen-1 else right_edge[y]
                diff += grid[x,y-1] if y>0      else up_edge[x]
                diff += grid[x,y+1] if y<ylen-1 else down_edge[x]
                new_grid[x,y] = grid[x,y] + q*diff
        grid, new_grid = new_grid, grid
    return grid

kern = compile_kernel(kernel_laplacian, np.ndarray, [np.ndarray, float, int])

size = 160
grid = np.zeros((size,size))
fill_with_start_value(grid)
q = a*dt/dx**2
t = 10000

# possible high level API for splitting a grid into subgrids in two dimensions
# start_2dgrid(kernel, work_size, args, returnval)
start_2dgrid(kern, (4,4), [SPLIT_2D(grid), q, t], GATHER_2D(grid))

# following is "low level" calls, doing the same the one line above
# initialize a python VM on each core in a (4,4) grid
init_VMs((4,4))

loc_size = size/4
split_grid = grid.reshape(4,loc_size,4,loc_size)
for x in range(4):
    for y in range(4):
        # allocates nparray on a core, returns handle
        local_array = alloc_nparray(on_core=(x,y), shape=(local_size,local_size), dtype=float)
        local_array.copy_from_host(split_grid[x,:, y,:])

        # not racy, since first started kernels wait for the others in communication
        start_kernel((x,y), kern, [local_array, q, t])

for x in range(4):
    for y in range(4):
        local_array = wait_kernel((x,y))
        local_array.copy_to_host(split_grid[x,:, y,:])

# now result is in grid

For simplicity I have excluded handling of boundary conditions :)

The big question is how synchronized the transfers should be. We want to utilize both of the DMA engines per core. One idea is that "send_array" just posts a pointer in the recievers mailbox, and then "recieve_array" polls the mailbox and initiates a DMA transfer, as soon as a DMA engine is free. (Maybe the real library should follow the API names and synchronization semantics of some established API like MPI, for programming ease)
fredd
 
Posts: 12
Joined: Wed Dec 19, 2012 1:09 pm

Re: Parallel python design

Postby markd » Mon Jan 21, 2013 8:05 am

Thanks, fredd, this is exactly the kind of example I was looking for.

This example uses two-sided communication (send/receive).

In general, the calls are
send_array(<target node>, <source data>)
which must be paired with
receive_array(<source node>, <target data>)

Just thinking about some alternate models/syntax:

One-sided communication
put_array(grid[0,:], LEFT, right_edge)
put_array(grid[-1:], RIGHT, left_edge)

in general: put_array(<source data>, <target node>, <target data>)

Or alternately, the get version
get_array(right_edge, LEFT, grid[0,:])

in general get_array(<target data>, <source node>, <source data>)

Or coarrays:
right_edge = grid(LEFT)[0,:]
left_edge = grid(RIGHT)[-1,:]

And all of these methods would require the synchronization call at the end.
markd
 
Posts: 11
Joined: Mon Dec 17, 2012 3:28 am

Re: Parallel python design

Postby tnt » Mon Jan 21, 2013 8:50 am

Just in case you missed it: Writing data to a neighbor core is _much_ faster than reading from a neighbor core.
tnt
 
Posts: 408
Joined: Mon Dec 17, 2012 3:21 am

Re: Parallel python design

Postby fredd » Mon Jan 21, 2013 3:54 pm

tnt wrote:Just in case you missed it: Writing data to a neighbor core is _much_ faster than reading from a neighbor core.

Is that primarily the latency, or is the _throughput_ much lower for a read (when using DMA for a bigger transfer)? In the later case the protocol above could be simply reversed. "recieve_array" will post a pointer to the sender of to where the data should be sent. And then "send_array" will recieve this message and initiate DMA.

markd wrote:in general: put_array(<source data>, <target node>, <target data>)

This also look as nice syntax. The question is how the sender will interpret the "target data" field, since this refers to an object of another interpreter. Either the memory layout has to be know in advance (this will be the implication of coarrays if I understand correctly. The coarray will simply be at the same location of every core). Or the sender has to communicate with the reciever to get the base address, strides & size of the named object; more flexible but more complicated to implement. I think one-sided will be useful when the memory layout is fixed, and two-sided communication will be the best when arrays are dynamically allocated (then each interpreter will mind only it's own objects)
fredd
 
Posts: 12
Joined: Wed Dec 19, 2012 1:09 pm

Re: Parallel python design

Postby ysapir » Mon Jan 21, 2013 4:10 pm

fredd wrote:
tnt wrote:Just in case you missed it: Writing data to a neighbor core is _much_ faster than reading from a neighbor core.

Is that primarily the latency, or is the _throughput_ much lower for a read (when using DMA for a bigger transfer)? In the later case the protocol above could be simply reversed. "recieve_array" will post a pointer to the sender of to where the data should be sent. And then "send_array" will recieve this message and initiate DMA.


This is the latency indeed. But for load instructions, it means that this latency severely affects the throughput. If you can manage to DMA the data then throughput is much higher.
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Parallel python design

Postby camara » Sat Jan 26, 2013 11:55 pm

Mark,

Have you considered talking to the pypy folks (pypy.org or #pypy on freenode). They already have a backend for arm I believe for v7 and on going work to support v6. They have started work on STM (software transaction memory) for parallel processing which potential could be helpful for the Parallella project. A backend for Epiphany would need to be created which could take a couple of man months and then some work would need to be done to bridge between the arm and epiphany cores which I'm sure would be doable given the pypy architecture.

Any way, its just something to consider.

John M. Camara
camara
 
Posts: 1
Joined: Sat Jan 26, 2013 11:36 pm

Re: Parallel python design

Postby henwee » Wed Mar 06, 2013 12:46 pm

Ahh now I get it! So latency is effecting the throughput so by controlling data dma we can increase this. I was wondering why writing data to a neighbor core is _much_ faster than reading from a neighbor core. ;)
Henry
henwee
 
Posts: 1
Joined: Wed Mar 06, 2013 12:40 pm


Return to Python

Who is online

Users browsing this forum: No registered users and 1 guest

cron