Page 3 of 4

Re: increase shared memory

PostPosted: Wed May 21, 2014 11:55 am
by timpart
greytery wrote:Tim - 'addressing memory on non-existent rows' .. Scary!
But what exactly do you mean?
And instead of moving to the left, what if we move to the right?

I'm not a HW/chip engineer- old ex-mainframe OS kernel hacker.
I don't understand how the FPGA is configured to map the E16 East-wards memory requests to the external DRAM, but it clearly does. There's some MMU magic/logic at the Zynq's Bank 35 interface to the E16.
OTOH, the FPGA also maps the ARM's memory requests onto the same DRAM - but represents that same physical space as different addresses for the host.
Seems that the memory can be as contiguous - or not - as the FPGA logic allows, or makes it appear to be. It's acting as a good old virtual paging kernel - like we had back in the good old '70s.

It took me ages to get this straight in my head, so no need to feel bad. The FPGA logic is really quite simple, it just routes stuff. (Hence the three lines!) The Epiphany chip on the other hand is more sophisticated.

The Epiphany chip has 16 or 64 cores. It only has a concept of physical memory. No MMU. No remapping of addresses. It runs programs, but doesn't have an operating system. (Or at least no one has written one yet.)
The FPGA contains a pair of ARM cores, they do have an MMU, and they run an operating system with virtual memory.

When an Epiphany core detects an off-core address it sends it to the eMesh for routing to the correct core. The first thing the eMesh decides is if the address is in the same column as the current location, if it isn't it sends the request East or West as appropriate to the address. (Second highest 6 bits, 20:25.) If the column is an exact match then the request goes North or South until it reaches the correct row, then it enters that core. If a request reaches the edge of a chip it goes into the eLink at that cardinal direction to go to the next Epiphany chip. East and West eLinks are most commonly used as the routing going in that direction first.

On the Parallella board, the East eLink is actually connected to the FPGA, which pretends it is another Epiphany chip. In fact all it does is map DRAM to addresses where the other chip(s) would be and make the reads and writes work. The Epiphany chip is oblivious to this. It just sends requests to off-chip addresses and things happen as it expects. The West eLink from the Epiphany isn't connected to anything. That was a design choice for the Parallella. So if the Epiphany did try to access addresses to the West those requests would get lost (and if it was a read the core would be stuck waiting forlornly for a reply that will never come.) The North and South eLinks are wired up to the expansion connectors on the board.

On to the FPGA. The physical DRAM occupies a fixed set of addresses. So the FPGA has to translate Epiphany address space to physical address space. No MMU involved here, just those three lines of code. 0x8e000000 to 0x8fffffff become 0x1e000000 to 0x1fffffff by changing the top bits. This happens transparently to the eLink.

Now for the ARM cores. They are running a virtual memory operating system. The first thing to do is to tell the memory manager that it can't use addresses 0x1e000000 to 0x1fffffff as physical RAM to put virtual pages into. Secondly that address range is made available to the memory map with the virtual address being the same as the physical one. This second bit allows programs running on the ARM host to access that bit of DRAM and share information with the Epiphany. I'm not sure how physical caching is set. (Caching not allowed?)

greytery wrote:The FPGA provides the East eLink interface, then it must be aware of the Epiphany network routing protocol rules. For the default E16 placement of (32,8), then the FPGA may say : if the column number of an address request is less than 7, ship further East, else check row, etc. That gives a finite set of East and then North/South addresses. That is, along rows and then up/down columns that may have no physical cores on.
The further West/right we move the E16, there is larger FPGA virtual mapping space to the East because there are more columns. The potential for addresses sent to the FPGA interface by the E16 is limited to 4 rows (or they wouldn't get routed to the East interface), but the columns can be from 0 to 59. Similarly for the E64, but with 8 rows and 55 columns.
tery


By moving West I meant moving to the Left, but yes if you give more address space to the East then you can fit in more contiguous RAM. (The contiguous range ends when FFFFF flips over to 00000 and the address represents column 0 in the next higher row.) The Epiphany chip knows where it is in the address space because it has some pins which set it. These can be overridden with certain PEC power connector pins.

Hope this helps,

Tim

Re: increase shared memory

PostPosted: Mon May 26, 2014 8:28 pm
by greytery
timpart wrote: Hope this helps

Sorry for delayed response, but Yes thank you. As illustrated by https://www.youtube.com/watch?v=pQHX-SjgQvQ , sometimes the manual just can't help.
Not sure I understood all that you, shodruck and Gravis wrote but, after several passes, I think I got there.

It seems that schmurfy is not the only one in need of access to increased memory - preferably contiguous.
In http://forums.parallella.org/viewtopic.php?t=1101&p=6938#p6944 ,
Andreas wrote:Question for all: How big of an issue is it to not support large contiguous areas of physical memory for software developers?

Now, see the topic on BOINC, http://forums.parallella.org/viewtopic.php?f=22&t=40&start=50#p7926 . It appears that the data space required for the BBC (Birthday Beer Challenge) is 3*2^22 (words) = 48MB. Not sure if that's 48MB in and 48MB out - but it's already bigger than the default 32MB buffer slice. So, looks like Andreas urgently needs a bigger slice of memory - otherwise it sounds like it might be a sad, dry, lonely, boring birthday party!!

I've been trying (- on paper, my boards haven't even got to Customs yet -) to place the E16/E64 at different row+col locations - left/right/up/down - but there is always a hole in the external memory map, as seen from the Epiphany, because the Epiphany can't use row+col addresses that conflict with those it occupies. A daughter board to re-base the chip location doesn't really fix that. So, (as shodruck mentioned I think) some form of hashing will be needed for larger contiguous memory space.
Also, note that the current e-lib DMA and data manipulation routines are not entirely safe, since there are no checks on when ranges flip-over at the col edges. Maybe memcpy() does - in e_read() and e_write() - but the e_dma routines have a warning against them in the SDK manual. So that needs to be, er, addressed.

I'm working on that hashing algorithm - much more to do - and my C is very rusty, let alone never having heard of verilog before. If it works:
- two external memory options; 256MB and 512MB; using different build options.
- the external memory model appears to be flat, no holes.
- the addresses used in ARM and Epiphany map to the same range; i.e. 0x30000000-0x3FFFFFFF (256MB) or 0x20000000-0x3FFFFFFF (512MB).
- the hashing involves flipping bits in the top byte of the address; should be 'trivial' in both the Epiphany outbound, and the FPGA inbound address translation.
- same code for E16 and E64.

"If it works.." :D

tery

Re: increase shared memory

PostPosted: Mon May 26, 2014 9:25 pm
by greytery
Larger external contiguous memory - high level proposal.

<NB: Work-in-progress - which should be obvious>

An Epiphany/Parallella program always knows when it's dealing with external memory, and calls (new) e-lib routines 'e_extmem_read()', etc. These map contiguous (virtual) external memory addresses to 32MB slices, so that they will traverse and exit the eMesh eastwards. The current e-lib data routines could be used for on-chip data movements, but probably should be replaced with 'e_intmem_read()', etc., so that they can be simplified and made more robust.
Once in the FPGA, the 32MB address slices are mapped back onto the contiguous physical DRAM.

A. Linux: the contiguous physical memory is allocated in the device tree file (as per Gravis and 9600 above). Linux loads and fits into the remainder without further re-config. That is:
Set "reg = <0x000000000 0x30000000>;" for 256MB, or
Set "reg = <0x000000000 0x20000000>;" for 512MB.

B. Utilities: e-hal, e-loader, etc., need to be made aware of the expanded memory. I don't have a full list of constants, but there are not that many. (There are some //TODO's comments on e_loader scalability, among others).

C. Epiphany: the base external memory address (also) starts at 0x30000000 (256MB) or 0x20000000 (512MB). The 'e_extmem_read()', etc., routines handle the src/dst addresses by remapping the row+col fields of the top 12 bits of an address., i.e. change the row by row-flip.
row_flip is defined as 4 (256MB) or 8 (512MB) builds.
In principle:

Code: Select all
      if (col < 32) then
              col = col + 32;
              row = row - row_flip;


Checks are made for edge conditions, where a range crosses the col 64/0 boundary, and the dma or data movement may be split if necessary.
Note, this is on-core, before exiting to the eMesh, so all addresses are either flipped to become east, or remain as east.
Address ranges are mapped as follows:

Code: Select all
Epiphany Address maps to  -> eMesh Address (256MB)
0x30000000-0x31FFFFFF     -> 0x22000000-0x23FFFFFF   // flip
0x32000000-0x33FFFFFF     -> 0x32000000-0x33FFFFFF   // as is
0x34000000-0x35FFFFFF     -> 0x26000000-0x27FFFFFF   // flip
0x36000000-0x37FFFFFF     -> 0x36000000-0x37FFFFFF   // as is
....                                                 // etc
0x3C000000-0x3DFFFFFF     -> 0x2E000000-0x2FFFFFFF   // flip
0x3E000000-0x3FFFFFFF     -> 0x3E000000-0x3FFFFFFF   // as is

Epiphany Address maps to  -> eMesh Address (512MB)
0x20000000-0x21FFFFFF     -> 0x02000000-0x03FFFFFF   // flip
0x22000000-0x23FFFFFF     -> 0x22000000-0x23FFFFFF   // as is
0x24000000-0x25FFFFFF     -> 0x06000000-0x07FFFFFF   // flip
0x26000000-0x27FFFFFF     -> 0x26000000-0x27FFFFFF   // as is
.....
0x30000000-0x31FFFFFF     -> 0x12000000-0x13FFFFFF   // flip
0x32000000-0x33FFFFFF     -> 0x32000000-0x3FFFFFFF   // as is
.....
0x3C000000-0x3DFFFFFF     -> 0x1E000000-0x1FFFFFFF   // flip
0x3E000000-0x3FFFFFFF     -> 0x3E000000-0x3FFFFFFF   // as is


D. FPGA: the destination address received from the eMesh needs to be flipped back if required, to point to either 256MB or 512MB of physical DRAM.
<working> :?
The reverse mapping depends on whether the row is less than the start_row, which is defined as 12 (256MB) or 8 (512MB).
Code: Select all
      if (row < start_row) then
            row = row + row_flip;
            col = col - 32;

The challenge here is to keep the FPGA logic as neat as those magic three lines - but my verilog is not very good. I believe that two bits need to be flipped - but that's as far as I've got tonight ...

Half-baked, I know, but if anybody can help here - even if proving that this is just a bovine byproduct - comments gratefully received.

tery

Re: increase shared memory

PostPosted: Mon May 26, 2014 10:09 pm
by timpart
greytery wrote:
Andreas wrote:Question for all: How big of an issue is it to not support large contiguous areas of physical memory for software developers?

Now, see the topic on BOINC, http://forums.parallella.org/viewtopic.php?f=22&t=40&start=50#p7926 . It appears that the data space required for the BBC (Birthday Beer Challenge) is 3*2^22 (words) = 48MB. Not sure if that's 48MB in and 48MB out - but it's already bigger than the default 32MB buffer slice. So, looks like Andreas urgently needs a bigger slice of memory - otherwise it sounds like it might be a sad, dry, lonely, boring birthday party!!


Well he needs a way of breaking the problem down. I think it might be worse than you say above. They say it is a real to complex transform so I think that means it would be two words per entry on the output, the real and complex parts of the value.

I'll read the rest of your posts when I'm feeling more awake. They look interesting but complicated.

Tim

Re: increase shared memory

PostPosted: Tue May 27, 2014 4:44 am
by shodruk
It is possible to implement adress remapping by software, so I think there is no problem with the shared memory system.

Re: increase shared memory

PostPosted: Tue May 27, 2014 5:30 pm
by greytery
timpart wrote: ... interesting but complicated.

This all came from trying to understand your comment about mapping addresses onto non-existent rows, which I found 'scary'. I'll settle for 'interesting'. And the 'complicated' is because my Verbosity Switch broke years ago - and they can't get the parts anymore. ;)

I've just realised it can be be simplified, a bit, if it helps.
There's no need to have different address mapping adjustments for 256MB or 512MB between the Epiphany and the FPGA.
It means that contiguous external memory can be allocated up to 512MB in 32MB increments - but that really would make it complicated to keep all those configuration constants aligned!
It also means that you would just need one version of the FPGA for up to 512MB external memory (the current default mapping in the SDK/e-lib needs to be re-based though - worth the headache?).

Epiphany:
Code: Select all
      if (col < 32) then
              col = col + 32;
              row = row - 8;


FPGA:
<still working> :?

Code: Select all
      if (row < 8) then
            row = row + 8;
            col = col - 32;


This involves flipping rows of addresses in 32MB chunks, but it boils down to just two bits ( [29],[25] ) of the ext_mem address.
After passing the test, Epiphany flips them from 1,0 to 0,1 - and the FPGA flips them from 0,1 to 1,0.
So I hope that, in practice, that's a bit of assembler in the Epiphany, and no more than three magic lines of verilog.

"If it works " :D

tery

Re: increase shared memory

PostPosted: Wed May 28, 2014 6:33 am
by timpart
I'm not sure I understand why you need to change the FPGA code to make the mapping go to contiguous physical DRAM. Why not just let it go to chunks? I would imagine that the linaro memory mapping could be configured to make it look contiguous from linaro's point of view.

Tim

Re: increase shared memory

PostPosted: Wed May 28, 2014 10:23 am
by greytery
timpart wrote:I'm not sure I understand why you need to change the FPGA code to make the mapping go to contiguous physical DRAM. Why not just let it go to chunks? I would imagine that the linaro memory mapping could be configured to make it look contiguous from linaro's point of view.

I take it that you go along with Epiphany programs seeing a contigous bank of memory via e_extmem() routines.

Configuring linaro memory management to remap the chunks back to contiguous memory for host programs sounds like messing with the kernel. Now, that's REALLY scary! :shock:
Didn't know you could do it. Just how complicated is that?

However it's done, the method would then need to be applied to all Linux builds (surely?).
The FPGA change would be common and completely transparent to all Linux versions.

Changing the code at the FPGA eLink interface looks trivial (been dreaming about it all night, but still "working").
Change management, i.e, making all the various components of the build (dts, e-hal ) come together, is the tricky part.

tery

Re: increase shared memory

PostPosted: Thu May 29, 2014 11:59 am
by greytery
ysapir wrote:Theoretically, there is no prevention for the Epiphany to access ALL of the DRAM on the board. Technically, this is really a shared, flat memory space.

In practice, the FPGA filters and limits the addressable shared memory to 32MB. So it's not actually currently possible to access "ALL of the DRAM on the board". Not that you should because the host OS needs some space (and maybe some protection from a rogue Epiphany program) .

With apologies for re-working this idea across several posts, just a small, itsy-bitsy change to the above means that ALL of the DRAM on the board can be seen from the Epiphany. Just flip the addresses onto different rows (up rather than down), as follows...

Epiphany e_extmem_() routines:
Code: Select all
      if (col < 32) then
              col = col + 32;
              row = row + 16;


FPGA magic three lines:
Code: Select all
      if (row >= 16) then     // [30] == 1
            row = row - 16;   // [30] = 0
            col = col - 32;   // [25] = 1


As before, this involves flipping rows of addresses in 32MB chunks, but it boils down to just two bits ( [30],[25] ) of the ext_mem address.
After passing the test, Epiphany e_extmem_() flips them from 1,0 to 0,1 - and the FPGA flips them from 0,1 to 1,0.

It means that the amount of contiguous shared memory could be up to, say, 768MB - leaving 256MB for the OS.
Any use? I think so.
In a Cluster, headless environment, there really is no excuse for carrying a full load of Ubuntu bloat on each board.
Linux (minus X, LibreOffice, dev tools, etc) can happilly run in 256MB (or less). There are Arch and Debian ARM builds that would fit in 256MB, I'm sure.
Not sure how much space the (unofficial) Debian build http://elinux.org/Parallella_Debian occupies, but I note that that also has many extras (such as ARM toolchain, Epiphany SDK and other -dev libs) that seem superfluous for a 'working' parallela cluster.
And that's before considering other OS's such as FreeRTOS, which already has a port to the Zynq 7000.

BTW: The above change to the FPGA is 'standalone'. It allows all host OS builds - subject to their particualar config rules - to share as much of the 1GB DRAM as they need.

<edit> #2 NB:
NEITHER flip-down nor flip-up algorithms are compatible with the current 32MB default memory and FPGA mapping.


I'm (almost) sure it works! :D

tery

Re: increase shared memory

PostPosted: Mon Jun 02, 2014 6:10 pm
by greytery
aolofsson wrote:The "magical" address translation is contained in these two files. 3 lines of code in total :-)

Code: Select all
assign ext_mem_access = (elink_dstaddr_tmp[31:28] == `VIRT_EXT_MEM) & ~(elink_dstaddr_tmp[31:20] == `AXI_COORD);
assign elink_dstaddr_inb[31:28] = ext_mem_access ? `PHYS_EXT_MEM : elink_dstaddr_tmp[31:28];
assign elink_dstaddr_inb[27:0] = elink_dstaddr_tmp[27:0];



Can someone explain what that AXI_COORD test is for, please?

From fpga_constants.v :
Code: Select all
`ifdef TARGET_E16
 `define AXI_COORD       12'h810
`elsif TARGET_E64
 `define AXI_COORD       12'h820
`endif

I think it's saying that the FPGA recognises that the eLink interface may send an address which is specifically row,col [32,16] (or [32,32] on E64).
Is this a piece of safety, legacy, bug?
Or is this going to be another Doh! moment for me...

tery