Epiphany hardware-MULTICAST and hardware-BARRIER

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Epiphany hardware-MULTICAST and hardware-BARRIER

Postby psiegl » Fri Nov 06, 2015 2:10 pm

Well, even so there are many who want to utilize the hardware multicast facility within the Epiphany, no example implementation is yet given.
I've implemented a simple store multicast (no DMA!), which is sufficient for my use case.

Seeing that there is just the manual for the multicast, I'm willing to share my implementation here.

Further enhancements to reduce the burden of the (very slow!) standard epiphany libs:
1) light-weight implementation of the hardware barrier
2) light-weight inline-methods for the config/status/counter registers
3) ... and as descriped, the support of the multicast

My code can be found on GITHUB:

Posts: 7
Joined: Mon Dec 17, 2012 3:29 am

Re: Epiphany hardware-MULTICAST and hardware-BARRIER

Postby aolofsson » Fri Nov 06, 2015 2:40 pm

Nice! Can you submit a pull request here to make sure folks can easily find it? (there are many folks who don't visit the forum)

User avatar
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Epiphany hardware-MULTICAST and hardware-BARRIER

Postby jar » Fri Nov 06, 2015 7:50 pm

This looks great, pseigl, and I'm happy to see the example code.

I'm curious how the hardware performs this multi-cast. Does it scale O(1) or O(log2(#cores))?

Might this scheme be able to efficiently multicast a larger buffer (something like 4+ KB) faster than DMAs using a logarithmic distribution and multiple barriers?

User avatar
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Epiphany hardware-MULTICAST and hardware-BARRIER

Postby psiegl » Fri Nov 06, 2015 10:52 pm

@aolofsson: Sure, I can do that.

@jar: Yes, I would also be interested in the scaling of the multicast. But seeing, that I only have the kickstarter parallella with 16 cores, doesn't offer a large enough coarse-grained processing array, which could really show realistic scaling values. As a side node: maybe Adapteva wants to supply researching chairs with the 64-core version. I would be willing to do some work at my university :P

So if I understood the manual right, you can run 2048 differnt multicasts in parallel, as there is a 11-bit wide field for the multicast identifier. Any core, willing to receive the multicast can just program its field to the specific id, the core's interested in.

The only issue I'm having right now, has been mentioned by aolofsson (http://forums.parallella.org/viewtopic.php?f=9&t=650#p4087). Only if the first row of the cores are issuing the multicast, the multicast reaches any of the cores within the coarse-grained array. If a core issues it, which is not within the first row, not all of the cores (even if all of them registered for it) are receiving it. So this is clearly an issue with it, and I hope that this might be potentially be solved by the Zynqs PL?

Due to larger transfer of buffers: that should be possible, but the DMA is first quite fast and second relieves the specific core. Still, as long as the core issues e.g. stores direct back-to-back it should be at least sufficient fast.
Posts: 7
Joined: Mon Dec 17, 2012 3:29 am

Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 12 guests