Memory transfer benchmark

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Re: Memory transfer benchmark

Postby ysapir » Thu Jun 20, 2013 5:50 am

shodruk wrote:Is this understanding about eLink correct?

eLink packet size is always 104 bits.
(data:32, src_address:32, dst_address:32, control:8)

Writing 32 bits costs 104 bps of bandwidth.

Reading 32 bits costs 208 bps of bandwidth.
(104 bits for request, 104 bits for response)


Not exactly. The 104 bits are all counted as a single 32-bit transaction. They all move in the same clock cycle, but only the actual data is counted towards the bps count.
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Memory transfer benchmark

Postby shodruk » Thu Jun 20, 2013 6:18 am

Isn't it about eMesh? I'm talking about eLink.
AFAICS eLink only has two(i/o) 8-bit serial link with no address bus...
Shodruky
shodruk
 
Posts: 464
Joined: Mon Apr 08, 2013 7:03 pm

Re: Memory transfer benchmark

Postby aolofsson » Thu Jun 20, 2013 4:11 pm

Is this what you are looking for? (data from Yaniv)

Using memcpy() @ 600MHz:

Core -> SRAM: Write speed = 504.09 MBps clocks = 9299
Core <- SRAM: Read speed = 115.65 MBps clocks = 40531
Core -> ERAM: Write speed = 142.99 MBps clocks = 32782
Core <- ERAM: Read speed = 4.19 MBps clocks = 1119132

Using dma_copy():

Core -> SRAM: Write speed = 1949.88 MBps clocks = 2404
Core <- SRAM: Read speed = 480.82 MBps clocks = 9749
Core -> ERAM: Write speed = 493.21 MBps clocks = 9504
Core <- ERAM: Read speed = 154.52 MBps clocks = 30336
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Memory transfer benchmark

Postby shodruk » Fri Jun 21, 2013 9:00 am

Significant improvement! (especially at ERAM Write speed) :o
But now I'm a little more confused.
What was the cause of that?
What I want to know is the calculation formula of the theoretical data bandwidth of eLink.
I want to know the correct specifications of Epiphany, because without it we can't determine where/how to optimize for Epiphany.
I read the reference manual, datasheet, hdl source code, but I felt the description of eLink is not very sufficient.

I have some ideas of memory optimization, so I need to know the detail of eLink.
(Using the specified core as a memory management unit, complex gather/scatter, prefetch, assign/feed data to another core, etc.)
Shodruky
shodruk
 
Posts: 464
Joined: Mon Apr 08, 2013 7:03 pm

Re: Memory transfer benchmark

Postby aolofsson » Fri Jun 21, 2013 1:15 pm

Thanks for analyzing the system bandwidth and pointing out the documentation deficiency. We will beef up the section on the link in the datasheet.Getting data transfer right takes careful design to ensure that there are no bottlenecks in the system and/or program. We are still working on optimizing the FPGA logic and software architecture to boost performance.

In the meantime, some pointers about the Epiphany link hardware.

-The elink has an "automatic" burst mode that only kicks in for 64bit data streams of sequential transactions with the following stride, e.g 0x0, 0x8, 0x10, 0x18,etc
-In this burst mode the elink transfer stream becomes: 32 bit address, 64 bit data, 64 bit data,64 bit data,....getting us very close to the peak theoretical bandwidth for large buffers.
-In all other transfer cases (read, byte,short,word, non-sequential addresses), the transfers are 104 bits (of which only 8-32 bits are "useful" link bandwidth).
-To maximize bandwidth, the cores should access off chip resources through the link in an orderly fashion (not randomly). This is similar to DRAM access constraints one would usually employ to avoid page thrashing. Still, we wish we would have put something in the link to make this link burst mode more automatic..next version of the chip ;)

This will be documented in the datasheet...

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Memory transfer benchmark

Postby shodruk » Sat Jun 22, 2013 5:31 am

Thank you very much for detailed explanation.

aolofsson wrote:In this burst mode the elink transfer stream becomes: 32 bit address, 64 bit data, 64 bit data,64 bit data,....getting us very close to the peak theoretical bandwidth for large buffers.


I'm glad to hear that!
That's just what I wanted! :D

Now I have learned these things.

    We should use 64 bit transfer mode as much as possible for maximizing transfer speed.

    The theoretical data write bandwidth is,
    [Epiphany Clock] Bytes/sec
    (636 MB/s at 667 MHz)

I guess the burst mode may be interfered with by the arbitrary external memory access from multiple cores, so my idea (using ONE core as a memory management core, and it serializes other core's memory access) may be suitable for such a case.

Now I want to know which core has minimal latency and hops on external memory access.

At the moment, read is slower than write, but this also may be conquered by these methods.

    eCore sends block read command(user defined) to the host.
    (the command has source address, destination address, and transfer size)

    The host stores these commands to the command queue. (for preventing blocking)

    The host reads the queue, then sends block data to Epiphany by burst transfer mode.
Shodruky
shodruk
 
Posts: 464
Joined: Mon Apr 08, 2013 7:03 pm

Re: Memory transfer benchmark

Postby ticso » Wed Jul 03, 2013 8:18 am

What options are available to copy from host to epiphany memory without doing read requests from epiphany?

So far I've seen memcopy on ARM, which was benchmarked in this thread.
Was it generic ARM memcopy or NEON enhanced code?

I've heared about DMA - has the ARM or current FPGA code generic transfer DMA?
Another thread however makes me believe there is no such thing right now and needs to be implemented in FPGA.
ticso
 
Posts: 41
Joined: Mon Dec 17, 2012 3:22 am
Location: Germany, Moers

Re: Memory transfer benchmark

Postby shodruk » Wed Jul 03, 2013 9:44 am

Maybe this page could help you, but it seems not very easy.

http://www.wiki.xilinx.com/Zynq+Linux+pl330+DMA
Shodruky
shodruk
 
Posts: 464
Joined: Mon Apr 08, 2013 7:03 pm

Re: Memory transfer benchmark

Postby tnt » Wed Jul 03, 2013 9:55 am

The bad performance when writing data from the ARM to the Epiphany is most likely not due to the ARM itself but just because in this datapath you're going through the GP AXI slave interface which is really not meant for high performance transfer. I'm not sure that using the DMA would help all that much there.

The best option is a DMA code on a HP AXI port that would read data from DDR and write directly to the e-link, skipping a lot of layers of the interconnect.
tnt
 
Posts: 408
Joined: Mon Dec 17, 2012 3:21 am

Re: Memory transfer benchmark

Postby shodruk » Wed Jul 03, 2013 12:21 pm

Is it possible the host or the eCore kicks the DMA from ERAM to SRAM?
Shodruky
shodruk
 
Posts: 464
Joined: Mon Apr 08, 2013 7:03 pm

PreviousNext

Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 4 guests

cron