Page 1 of 1

An async messaging lib for parallella,

PostPosted: Thu Jun 29, 2017 3:56 pm
by joseluisquiroga
This is a library I have been working on:

https://messaging-cells.github.io/

Greetings.

JLQ.

Re: An async messaging lib for parallella,

PostPosted: Wed Jul 05, 2017 10:06 am
by sebraa
From your web page:
Cell_A (sending) always does local core writing and Cell_B (receiving) always does remote core reading. No message copying.
This is not necessarily a good idea, since remote reads are slow. Rather than local-write/remote-read, you should aim for remote-write/local-read. I assume that your library is written in C++ (the page is not too clear on that)?


I haven't looked at the code, but good luck with your project!

Re: An async messaging lib for parallella,

PostPosted: Wed Jul 05, 2017 3:40 pm
by joseluisquiroga
Thanks for the post.

I do not know why I had in mind a 2x faster relation between w/r for on-chip transactions. Maybe I got that idea from the image in page 26 of the Epiphany Arch Ref Manual. While looking to quote an answer from the manual I found in page 24 that it says that it is 16x.

Maybe you know where the 16x comes from?

It seems a lot but for my purposes I think I will stick to local writes. My big O complexity function still the same.

How to keep code small enough to run it from local mem? This seems to be the way. Local reads of remote written data means your code will look like networking code.

Running it from off-chip mem seems a lot worse than this 16x because you loose all parallellism benefits. Off-chip mem is a bottle neck (not to mention indeterministic behavior). So, any program that relies, at run time, on off-chip data or off-chip code will have this issue.

Cheers.

Re: An async messaging lib for parallella,

PostPosted: Wed Jul 05, 2017 4:18 pm
by jar
joseluisquiroga wrote:Maybe you know where the 16x comes from?


I haven't benchmarked your code, but below are benchmark times from OpenSHMEM get and put tests. The bandwidth for put (write) peaks at 2.4 GB/s and get (read) achieves 233 MB/s. Or about 10x different. There is more variability in reads since the latency is proportional to metropolis distance. The core stalls as the read request traverses the mesh network. A 16x difference is not necessarily wrong. For small transfers, the setup time for putmem/getmem routines dominate and it would be better to use a direct copy with shmem_TYPE_put or shmem_TYPE_get.

SHMEM GetMem times for variable message size:














Bytes Latency (nanoseconds)
199
2143
4229
8401
16138
32206
64368
128644
2561194
5122296
10244499
20488906
409617719
819235343


SHMEM PutMem times for variable message size:














Bytes Latency (nanoseconds)
173
288
4118
8178
1679
3283
64123
128149
256203
512309
1024523
2048949
40961804
81923513

Re: An async messaging lib for parallella,

PostPosted: Wed Jul 05, 2017 8:25 pm
by joseluisquiroga
Thanks for the data. It helps.

JLQ.

Re: An async messaging lib for parallella,

PostPosted: Thu Jul 06, 2017 9:28 am
by sebraa
joseluisquiroga wrote:Maybe you know where the 16x comes from?
The read-request mesh is narrower than the write-mesh, which makes read requests travel much slower than writes. Also, reads are synchronous (the core stalls until the read has been answered), so you get to pay for both the long read request latency and the smaller write latency, for every single read. I don't know where the 16x comes from, though - the rMesh accounts for 8x plus the read reply time, which matches the 10x shown in the measurements.

joseluisquiroga wrote:My big O complexity function still the same.
Scalability only matters if the base speed is sufficient. Giving up 10x for no big reason isn't necessarily good.

joseluisquiroga wrote:Local reads of remote written data means your code will look like networking code.
It is called "Network-on-chip" for a reason. :)

I would not disregard off-chip memory completely; accesses are insanely slow, but having them may spell the difference between "slow" and "impossible". Also, the DMA engines may be able to - at least to some extent - cushion the impact of using it. Even execution from off-chip memory may be worth it in some cases (e.g. one-time initialization routines are usually not time-critical).

Re: An async messaging lib for parallella,

PostPosted: Thu Jul 06, 2017 3:16 pm
by joseluisquiroga
Giving up 10x for no big reason isn't necessarily good.


I guess that 10x faster of undesired behavior (indeterministic) that leads to cumbersome code means nothing to me.

It is a base lib. If necessary, for an specific case, you can set up with it "channels" between cells for only writing in one direction from "cell_A" to "cell_B" in which the address and space is known and a cheap (in time and space) protocol can be set up.

It is called "Network-on-chip" for a reason.


I do not think we want TCP running between cores :)

I would not disregard off-chip memory completely


Me neither ;)

Thanks for the post.