Planning parallel accelerated programming structures

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Planning parallel accelerated programming structures

Postby theover » Sat Jul 12, 2014 12:55 am

Hi all,

I've looked at the various parts of the path Zinq<-->Fpga<-->Epiphany, and came to the conclusion that I cannot easily find out the approximate but bit accurate bandwidths and latencies of the various paths between a Zinq process/thread and a Epiphany core.

Wouldn't it be good to have some sort of chart with the main idea explained, like what is the write speed from a Zynq process claimed Dynamic Ram, possibly in 1st or second level Zynq cache to a FPGA buffer (via AIX) or to the shared ram between the FPGA and the ARMs (I don't know if this option is used), and what the possible clash is with HDMI DMA activity, and what is the (current) actual bandwidth from the FPGA to the Epiphany with the various communication options (dma or mapped).

So that when I want to know I can process data in a given limited time frame, what the achievable bandwidth is, and how long it takes to get the data from a Zinq process address space, in worst case, to a Epiphany core starting to process it ?

T.V.
theover
 
Posts: 181
Joined: Mon Dec 17, 2012 4:50 pm

Re: Planning parallel accelerated programming structures

Postby aolofsson » Wed Aug 20, 2014 10:21 pm

T.V,
Does the following help as a start?
https://github.com/adapteva/epiphany-ex ... width-test
Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Planning parallel accelerated programming structures

Postby theover » Thu Aug 21, 2014 8:48 pm

Hi Andreas

Of course I've with some pleasure found the examples and ran them to satisfy myself that the board andthe chips function, and the compiler (which is great to have all in one), so I certainly have run the one but latest or so bandwidth test, and felt right at home with specifications of the kind of bytes/sec, etc.

However, in line with work on parallel programming that I did when I worked at university (a complicated computer graphics acceleration project in the 90s), it's always a problem to get the bandwidth up to pleasant specs, and even more so: what is the latency and the startup time for various DMAs, etc.

So while I'm glad with the bandwidth example, and can certainly use it for some of the purposes I've set for myself, and probably with use some of the FPGA channels for my own purposes, ripping a bit from the good examples (if that proves to be an oversee-able job), the achieved bandwidths are fine for some moderately fast FPGA communication, which is great, but on the low side for the promised super computing specs, even for a cheap board. I understand it may be a lot of work to debug fast busses, and that hopefully the main FPGA<->Epiphany connection can be brought up to approach the speed (bandwidth) promised in the original project specification, and that a lot of loose ends come up with such complicated projects, but frankly it is a bit disappointing to see the actual bandwidth figures compare to what was promised long ago.

And I'd like to know, I suppose I can figure it out, I've started with that a while ago, the latency as well.

Then there is the let's call it "system map": it's clear to me there is a main ARM memory with 2 caches, a FPGA that can access that memory, and that can be accessed by the ARM processors (with some form of prioritizing scheme between the 2 cores), there's a special shared memory between the ARMs and the FPGA, then there is potential local memory (2 kinds: logic based and prefab banks) in the FPGA, and then there is the connection from the FPGA pins to the DMA interface of one of the network connections of the Epiphany, with the ability to access the local cores, and get accessed by them.

What I'd like to know is: which of these interfaces get active, with what latency and bandwidth in the various examples. And for the highest bandwith and/or the lowest latency connection scheme, how that runs, for instance:

ARM(i) ..> Shared Dynamic RAM --> FPGA DMA port so-and-so --> local FPGA memory (or word) this-and0that --> FPGA pins/Epiphany interface --> n routing steps --> local Epiphany core (j) (addresses "A")

and when somehow the supplied library signifies that this all has been done atomically, or has with certainty been executed, or will take a specified amount of time, and can be verified with call() so-and-so, etc.

This would fit in a nice image (pdf ?) of one page, and make a lot clear I cannot readily read from the libraries system manuals or examples!

Regards,

Theo Verelst
http://www.theover.org
theover
 
Posts: 181
Joined: Mon Dec 17, 2012 4:50 pm


Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 3 guests

cron