Page 2 of 3

Re: Memory transfer benchmark

PostPosted: Thu Jun 20, 2013 5:50 am
by ysapir
shodruk wrote:Is this understanding about eLink correct?

eLink packet size is always 104 bits.
(data:32, src_address:32, dst_address:32, control:8)

Writing 32 bits costs 104 bps of bandwidth.

Reading 32 bits costs 208 bps of bandwidth.
(104 bits for request, 104 bits for response)


Not exactly. The 104 bits are all counted as a single 32-bit transaction. They all move in the same clock cycle, but only the actual data is counted towards the bps count.

Re: Memory transfer benchmark

PostPosted: Thu Jun 20, 2013 6:18 am
by shodruk
Isn't it about eMesh? I'm talking about eLink.
AFAICS eLink only has two(i/o) 8-bit serial link with no address bus...

Re: Memory transfer benchmark

PostPosted: Thu Jun 20, 2013 4:11 pm
by aolofsson
Is this what you are looking for? (data from Yaniv)

Using memcpy() @ 600MHz:

Core -> SRAM: Write speed = 504.09 MBps clocks = 9299
Core <- SRAM: Read speed = 115.65 MBps clocks = 40531
Core -> ERAM: Write speed = 142.99 MBps clocks = 32782
Core <- ERAM: Read speed = 4.19 MBps clocks = 1119132

Using dma_copy():

Core -> SRAM: Write speed = 1949.88 MBps clocks = 2404
Core <- SRAM: Read speed = 480.82 MBps clocks = 9749
Core -> ERAM: Write speed = 493.21 MBps clocks = 9504
Core <- ERAM: Read speed = 154.52 MBps clocks = 30336

Re: Memory transfer benchmark

PostPosted: Fri Jun 21, 2013 9:00 am
by shodruk
Significant improvement! (especially at ERAM Write speed) :o
But now I'm a little more confused.
What was the cause of that?
What I want to know is the calculation formula of the theoretical data bandwidth of eLink.
I want to know the correct specifications of Epiphany, because without it we can't determine where/how to optimize for Epiphany.
I read the reference manual, datasheet, hdl source code, but I felt the description of eLink is not very sufficient.

I have some ideas of memory optimization, so I need to know the detail of eLink.
(Using the specified core as a memory management unit, complex gather/scatter, prefetch, assign/feed data to another core, etc.)

Re: Memory transfer benchmark

PostPosted: Fri Jun 21, 2013 1:15 pm
by aolofsson
Thanks for analyzing the system bandwidth and pointing out the documentation deficiency. We will beef up the section on the link in the datasheet.Getting data transfer right takes careful design to ensure that there are no bottlenecks in the system and/or program. We are still working on optimizing the FPGA logic and software architecture to boost performance.

In the meantime, some pointers about the Epiphany link hardware.

-The elink has an "automatic" burst mode that only kicks in for 64bit data streams of sequential transactions with the following stride, e.g 0x0, 0x8, 0x10, 0x18,etc
-In this burst mode the elink transfer stream becomes: 32 bit address, 64 bit data, 64 bit data,64 bit data,....getting us very close to the peak theoretical bandwidth for large buffers.
-In all other transfer cases (read, byte,short,word, non-sequential addresses), the transfers are 104 bits (of which only 8-32 bits are "useful" link bandwidth).
-To maximize bandwidth, the cores should access off chip resources through the link in an orderly fashion (not randomly). This is similar to DRAM access constraints one would usually employ to avoid page thrashing. Still, we wish we would have put something in the link to make this link burst mode more automatic..next version of the chip ;)

This will be documented in the datasheet...

Andreas

Re: Memory transfer benchmark

PostPosted: Sat Jun 22, 2013 5:31 am
by shodruk
Thank you very much for detailed explanation.

aolofsson wrote:In this burst mode the elink transfer stream becomes: 32 bit address, 64 bit data, 64 bit data,64 bit data,....getting us very close to the peak theoretical bandwidth for large buffers.


I'm glad to hear that!
That's just what I wanted! :D

Now I have learned these things.

    We should use 64 bit transfer mode as much as possible for maximizing transfer speed.

    The theoretical data write bandwidth is,
    [Epiphany Clock] Bytes/sec
    (636 MB/s at 667 MHz)

I guess the burst mode may be interfered with by the arbitrary external memory access from multiple cores, so my idea (using ONE core as a memory management core, and it serializes other core's memory access) may be suitable for such a case.

Now I want to know which core has minimal latency and hops on external memory access.

At the moment, read is slower than write, but this also may be conquered by these methods.

    eCore sends block read command(user defined) to the host.
    (the command has source address, destination address, and transfer size)

    The host stores these commands to the command queue. (for preventing blocking)

    The host reads the queue, then sends block data to Epiphany by burst transfer mode.

Re: Memory transfer benchmark

PostPosted: Wed Jul 03, 2013 8:18 am
by ticso
What options are available to copy from host to epiphany memory without doing read requests from epiphany?

So far I've seen memcopy on ARM, which was benchmarked in this thread.
Was it generic ARM memcopy or NEON enhanced code?

I've heared about DMA - has the ARM or current FPGA code generic transfer DMA?
Another thread however makes me believe there is no such thing right now and needs to be implemented in FPGA.

Re: Memory transfer benchmark

PostPosted: Wed Jul 03, 2013 9:44 am
by shodruk
Maybe this page could help you, but it seems not very easy.

http://www.wiki.xilinx.com/Zynq+Linux+pl330+DMA

Re: Memory transfer benchmark

PostPosted: Wed Jul 03, 2013 9:55 am
by tnt
The bad performance when writing data from the ARM to the Epiphany is most likely not due to the ARM itself but just because in this datapath you're going through the GP AXI slave interface which is really not meant for high performance transfer. I'm not sure that using the DMA would help all that much there.

The best option is a DMA code on a HP AXI port that would read data from DDR and write directly to the e-link, skipping a lot of layers of the interconnect.

Re: Memory transfer benchmark

PostPosted: Wed Jul 03, 2013 12:21 pm
by shodruk
Is it possible the host or the eCore kicks the DMA from ERAM to SRAM?