Parallella Memory

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Parallella Memory

Postby grzeskob » Mon Nov 17, 2014 10:09 pm

Hi,

I am trying to understand the Memory architecture of Parallella. Unfortunately I have less knowledge of VHDL and can not readout details from hw files. I will be grateful for comments if I misunderstand something.

According to documentation and posts I could summarize:
There are 3 sorts of RAM available in system [2]
1. Host Ram
Q1. Is it "O/S DRAM" from Figure 4. Reference Manual (marked as Off-Chip), or On-Chip Memory described in Ch.29 Zynq-7000 Technical reference manual ? Where can I find the details about this memory. Size, address, accessibility etc.
2. Shared Ram
Shared RAM memory - 1 GB 32-bit - accessible for both Dual-core ARM and each epiphany core
Described as SHARED DRAM on Figure 4. Reference Manual
3. Epiphany local Ram
16 x 32 kB accessible for Dual-core ARM and each epiphany core
Described as Epiphany on Figure 4. Reference Manual

Dataflow:
Q2. Data between epiphany cores is exchanged via rMesh and cMesh. It might be also exchanged via shared Ram, but it does not have a lot of sense considering bad performance.

"The Epiphany coprocessor is connected to the Zynq SOC via the 48-pin eLink interface" [1].
Q3. Does eLink interface have anything to do with eMesh NOC ? If not, how are the data transferred from elink to particular epiphany core ?
Q4. Data exchange between Epiphany and Shared Ram goes through eLink<->FPGA<->MEM-CTRL<->Shared Ram without any ARM DualCore interaction ?
Q5. All data exchange possibilities between ARM DualCore and Epiphany are :
- ARM can send binary (code and data) to epiphany via eLink
- ARM can read Epiphany local Ram via eLink
- ARM can read/write to shared memory via MEM-CTRL
- Epiphany can read/write to shared memory according to Q4
- Epiphany can not read/write from/to Host Ram

References:
[1] Epiphany Architecture Reference
[2] (http://forums.parallella.org/viewtopic.php?f=13&t=1670&p=10459&hilit=RAM#p10459)
grzeskob
 
Posts: 12
Joined: Mon Nov 17, 2014 8:36 pm

Re: Parallella Memory

Postby sebraa » Mon Nov 17, 2014 11:55 pm

Host DRAM is probably the "O/S DRAM", although I have not checked this detail. I don't think this matters. It is 1 GB in size. Shared DRAM is located in the same "Host RAM", and just a 32 MB part of it. Local RAM is 32 KB in size and inside each Epiphany core.

The Epiphany cores exchange data with the rMesh, cMesh and xMesh only. The external hardware interface to the mesh (available on the pins) is called eLink.

The eLink interface is implemented in the FPGA logic of the Xilinx Zynq and connected to the AXI bus of the ARM cores inside the Xilinx Zynq. There is probably no software involved when transmitting between Epiphany and Shared DRAM.

This is my understanding of it. If I'm wrong, please correct me.
Last edited by sebraa on Mon Nov 24, 2014 1:57 pm, edited 1 time in total.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Parallella Memory

Postby greytery » Tue Nov 18, 2014 4:28 pm

Hi grzeskob,

First, the Epiphany is not the Parallella. (It's a bit like using the spec of an AMD graphics chip to determine how a PC's memory is managed).
The Epiphany manual is written from the view of the Epiphany which could be housed in any number of systems, of which the Parallella-16 is one example.
I suggest you check the Parallella manual. That will give you a better overview of how the Parallella memory works as a whole, from the point of view of the ARM/Linux/Host code and from the Epiphany viewpoint. On Parallella, the key is how the Zynq maps the memory.

Also, please read this post by timpart, which I found very useful.

Adding to sebraa's answer, my take is as follows ...
Q1. See Parallella manual.

Q2. If the dataset is so large that it exceeds the size of the Epiphany memory, then one way would indeed be to use the 32Mb shared RAM. And yes, the access speeds are considerably slower and that would need to be taken into account when designing Epiphany programs - and may not be worth the effort for some cases. (Note that the next version of the SDK is supposed to include a 'paging' facility for the Epiphany, so clearly it is thought that there will be some cases where that would be useful).

Q3. There is an eLink interface at each of the N,S,E & W edges which effectively terminates the eMesh. The eLink is supposed to be the interface between Epiphany chips and would pass the data through to the eMesh, which continues the routing according to the (row,col) coordinates. The Zynq implements an interface to the East eLink (the West eLink is not connected, and usually, nothing is connected to the North or South either). When the ARM writes to an address which maps to an Epiphany core the Zynq routes it to the East eLink, which passes it through to the eMesh - and then it's a question of what (row,col) coordinates are set.

Q4. Correct. The Zync maps the address used by the Epiphany (row,col) onto the address of the 32Mb shared memory.

Q5. Correct - if you expand 'eLink' to include the Zynq FPGA interface to the East eLink interface.
Also, the "Epiphany can not read/write from/to the Host RAM" i.e. the DRAM left after reserving the top 32Mb for Shared memory. The way that the Zync/eLink interface remaps the addresses coming from the Epiphany to the Host limits the access to that top 32MB. The translation code is mentioned here.

Hope that helps,
tery
User avatar
greytery
 
Posts: 205
Joined: Sat Dec 07, 2013 12:19 pm
Location: ^Wycombe, UK

Re: Parallella Memory

Postby jrambo316 » Tue Nov 25, 2014 6:29 pm

greytery wrote:Q2. If the dataset is so large that it exceeds the size of the Epiphany memory, then one way would indeed be to use the 32Mb shared RAM. And yes, the access speeds are considerably slower and that would need to be taken into account when designing Epiphany programs - and may not be worth the effort for some cases. (Note that the next version of the SDK is supposed to include a 'paging' facility for the Epiphany, so clearly it is thought that there will be some cases where that would be useful).


I have been reading the posts on the issue of programs that are too big to fit into the local memory and have not seen reference to a couple more options:
1. In the case of single program multiple data, it seems like it would be possible to have one copy of the program/data spread across multiple cores' local memory and then just point all the cores to it (each core with a separate stack though). It seems like that would be faster than external memory, but I wonder how the mesh would handle that.
2. What about the block RAM on the FPGA? The 7020 says it has 512 KB block RAM. Would it be possible to put one copy of the program in there and point all the cores to it? Faster than the DRAM access?
Be gentle, I am a software guy!
jrambo316
 
Posts: 4
Joined: Tue Nov 25, 2014 6:16 pm

Re: Parallella Memory

Postby grzeskob » Wed Nov 26, 2014 8:17 am

@greytery, @sebraa, thank you for very useful answers.

jrambo316 wrote:1. In the case of single program multiple data, it seems like it would be possible to have one copy of the program/data spread across multiple cores' local memory and then just point all the cores to it (each core with a separate stack though). It seems like that would be faster than external memory, but I wonder how the mesh would handle that.


Does it mean, that each core would request memory access (via mesh) each tact, to fetch new instruction ? I really like this Idea, It seems like good benchmark - stress test for mesh network ?
Could you share the links to corresponding posts ?
grzeskob
 
Posts: 12
Joined: Mon Nov 17, 2014 8:36 pm

Re: Parallella Memory

Postby sebraa » Wed Nov 26, 2014 11:19 am

Reading from the mesh is much slower than writing. Although it is possible to execute code from anywhere in the address space, you should aim for local execution at least. You could copy the code to the local core first, though.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Parallella Memory

Postby greytery » Wed Nov 26, 2014 11:24 am

Need to think of the data aspect too. All code and no data make a dull, usually pointless, program.
jrambo316 wrote:1. In the case of single program multiple data, it seems like it would be possible to have one copy of the program/data spread across multiple cores' local memory and then just point all the cores to it (each core with a separate stack though). It seems like that would be faster than external memory, but I wonder how the mesh would handle that.

The instruction fetch in each core doesn't work like that and, even if it did, the mesh would be running flat out and you'd have no bandwidth left to shovel the data. Also, the mesh is a weak memory-order model, and there would be no guarantee of the order of instructions - so no go. But reading and writing data on other cores is possible.
jrambo316 wrote:2. What about the block RAM on the FPGA? The 7020 says it has 512 KB block RAM. Would it be possible to put one copy of the program in there and point all the cores to it? Faster than the DRAM access?

Reading up on the Zynq makes me realise just what a bargain the Parallella board actually is!!! All sorts of weird wired possibilities here, including using that RAM.
There's less block RAM available on a 7010, but it is there. Some of that is (probably) used for Zynq IP blocks anyway (buffers, etc) depending on what's configured, so who knows just how much 'spare' there is. But access to the Zynq RAM from the Epiphany would be constrained by the eLink interface - so just as slow as shared memory.
A good bit of FPGA logic would be required to sew together the blocks of RAM. I consider that to be firmware rather than software.

Cheers,
tery
User avatar
greytery
 
Posts: 205
Joined: Sat Dec 07, 2013 12:19 pm
Location: ^Wycombe, UK

Re: Parallella Memory

Postby cmcconnell » Wed Nov 26, 2014 2:53 pm

I've been meaning to ask about the meaning of the following in the SDK documentation -
5.6 Memory Management Examples

The Epiphany SDK gives the programmer complete control over data and code placement through section attributes that can be embedded in the source code. Memory management attributes placed in the source code will be ignored by the standard Linux GCC compiler.

The following examples illustrate some attributes that can be placed inside the source code or in a stub ‘*.c’ file that gets compiled with the rest of the application source file. The general attribute should be placed outside the main routine.

1. How to specify the core where the executable will run:

asm(".global __core_row__;");
asm(".set __core_row__,0x20;");
asm(".global __core_col__;");
asm(".set __core_col__,0x24;");


If I'm reading it right, this seems to be a suggested method of arranging for all of the code in a given program to be placed in the memory of one specific core. (i.e., 'where the executable will run' means 'from where the executable will be run' ? )

Presumably, then, each core that will actually be executing the program will still get its own small startup routine, responsible for jumping to the shared main function. (Or have I completely misunderstood?)
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: Parallella Memory

Postby jrambo316 » Wed Nov 26, 2014 6:10 pm

Thanks for the informative responses. I did not know whether the slow external memory was due to the fpga interface or beyond it. If the weak memory model in the mesh is a problem with instruction order, how can programs be run solely in external memory, e.g. the legacy ldf?
jrambo316
 
Posts: 4
Joined: Tue Nov 25, 2014 6:16 pm

Re: Parallella Memory

Postby jrambo316 » Wed Nov 26, 2014 6:21 pm

grzeskob wrote:@greytery, @sebraa, thank you for very useful answers.

jrambo316 wrote:1. In the case of single program multiple data, it seems like it would be possible to have one copy of the program/data spread across multiple cores' local memory and then just point all the cores to it (each core with a separate stack though). It seems like that would be faster than external memory, but I wonder how the mesh would handle that.


Does it mean, that each core would request memory access (via mesh) each tact, to fetch new instruction ? I really like this Idea, It seems like good benchmark - stress test for mesh network ?
Could you share the links to corresponding posts ?


Sorry, I was referring to posts regarding the subject in general. I had not seen what I proposed. If all cores ran the same program, I did not see why there should be multiple copies unless it was small enough to run locally on each core. I assumed that the flat address space would allow one copy spread across the space. How is that really different from running in external RAM except faster? Obviously, I need to understand the memory model better and the instruction fetching mechanism. I have not seen docs describing the latter. Any pointers?
jrambo316
 
Posts: 4
Joined: Tue Nov 25, 2014 6:16 pm

Next

Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 6 guests

cron