An async messaging lib for parallella,

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

An async messaging lib for parallella,

Postby joseluisquiroga » Thu Jun 29, 2017 3:56 pm

This is a library I have been working on:

https://messaging-cells.github.io/

Greetings.

JLQ.
joseluisquiroga
 
Posts: 24
Joined: Fri Dec 09, 2016 4:41 pm
Location: Bogota, Colombia

Re: An async messaging lib for parallella,

Postby sebraa » Wed Jul 05, 2017 10:06 am

From your web page:
Cell_A (sending) always does local core writing and Cell_B (receiving) always does remote core reading. No message copying.
This is not necessarily a good idea, since remote reads are slow. Rather than local-write/remote-read, you should aim for remote-write/local-read. I assume that your library is written in C++ (the page is not too clear on that)?


I haven't looked at the code, but good luck with your project!
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: An async messaging lib for parallella,

Postby joseluisquiroga » Wed Jul 05, 2017 3:40 pm

Thanks for the post.

I do not know why I had in mind a 2x faster relation between w/r for on-chip transactions. Maybe I got that idea from the image in page 26 of the Epiphany Arch Ref Manual. While looking to quote an answer from the manual I found in page 24 that it says that it is 16x.

Maybe you know where the 16x comes from?

It seems a lot but for my purposes I think I will stick to local writes. My big O complexity function still the same.

How to keep code small enough to run it from local mem? This seems to be the way. Local reads of remote written data means your code will look like networking code.

Running it from off-chip mem seems a lot worse than this 16x because you loose all parallellism benefits. Off-chip mem is a bottle neck (not to mention indeterministic behavior). So, any program that relies, at run time, on off-chip data or off-chip code will have this issue.

Cheers.
joseluisquiroga
 
Posts: 24
Joined: Fri Dec 09, 2016 4:41 pm
Location: Bogota, Colombia

Re: An async messaging lib for parallella,

Postby jar » Wed Jul 05, 2017 4:18 pm

joseluisquiroga wrote:Maybe you know where the 16x comes from?


I haven't benchmarked your code, but below are benchmark times from OpenSHMEM get and put tests. The bandwidth for put (write) peaks at 2.4 GB/s and get (read) achieves 233 MB/s. Or about 10x different. There is more variability in reads since the latency is proportional to metropolis distance. The core stalls as the read request traverses the mesh network. A 16x difference is not necessarily wrong. For small transfers, the setup time for putmem/getmem routines dominate and it would be better to use a direct copy with shmem_TYPE_put or shmem_TYPE_get.

SHMEM GetMem times for variable message size:














Bytes Latency (nanoseconds)
199
2143
4229
8401
16138
32206
64368
128644
2561194
5122296
10244499
20488906
409617719
819235343


SHMEM PutMem times for variable message size:














Bytes Latency (nanoseconds)
173
288
4118
8178
1679
3283
64123
128149
256203
512309
1024523
2048949
40961804
81923513
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: An async messaging lib for parallella,

Postby joseluisquiroga » Wed Jul 05, 2017 8:25 pm

Thanks for the data. It helps.

JLQ.
joseluisquiroga
 
Posts: 24
Joined: Fri Dec 09, 2016 4:41 pm
Location: Bogota, Colombia

Re: An async messaging lib for parallella,

Postby sebraa » Thu Jul 06, 2017 9:28 am

joseluisquiroga wrote:Maybe you know where the 16x comes from?
The read-request mesh is narrower than the write-mesh, which makes read requests travel much slower than writes. Also, reads are synchronous (the core stalls until the read has been answered), so you get to pay for both the long read request latency and the smaller write latency, for every single read. I don't know where the 16x comes from, though - the rMesh accounts for 8x plus the read reply time, which matches the 10x shown in the measurements.

joseluisquiroga wrote:My big O complexity function still the same.
Scalability only matters if the base speed is sufficient. Giving up 10x for no big reason isn't necessarily good.

joseluisquiroga wrote:Local reads of remote written data means your code will look like networking code.
It is called "Network-on-chip" for a reason. :)

I would not disregard off-chip memory completely; accesses are insanely slow, but having them may spell the difference between "slow" and "impossible". Also, the DMA engines may be able to - at least to some extent - cushion the impact of using it. Even execution from off-chip memory may be worth it in some cases (e.g. one-time initialization routines are usually not time-critical).
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: An async messaging lib for parallella,

Postby joseluisquiroga » Thu Jul 06, 2017 3:16 pm

Giving up 10x for no big reason isn't necessarily good.


I guess that 10x faster of undesired behavior (indeterministic) that leads to cumbersome code means nothing to me.

It is a base lib. If necessary, for an specific case, you can set up with it "channels" between cells for only writing in one direction from "cell_A" to "cell_B" in which the address and space is known and a cheap (in time and space) protocol can be set up.

It is called "Network-on-chip" for a reason.


I do not think we want TCP running between cores :)

I would not disregard off-chip memory completely


Me neither ;)

Thanks for the post.
joseluisquiroga
 
Posts: 24
Joined: Fri Dec 09, 2016 4:41 pm
Location: Bogota, Colombia


Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 4 guests

cron