Parallella Community

by **GreggChandler** » Mon Mar 27, 2017 6:23 pm

Although the eSDK exposes an interface to map External Shared Memory to the address space of the host processor, does the hardware interface support mapping of an individual processor's memory to the host address space? I understand how to use e_read() and e_write, however, I want to move some shared data structures from external memory into the core's direct address space and access them directly from C/C++ code on the host much as I do the external shared memory. My algorithms work with shared memory, however, my testing indicates that accessing the data in shared external memory from an eCore is between ten and twenty times slower than accessing core data local to the core or in another core. Secondarily, if such a mapping is technically possible, how will the processor resolve access contention?

by **peteasa** » Tue Mar 28, 2017 2:26 pm

Hi Gregg,

Yes this is possible. Each core maps to a different base address in the arm processor memory, for example when an application is launched on the e-core the host first saves the program data to local e-core memory then writes a structure to tell the e-core which e-core it is in the group then causes the e-core to start running. All this happens because the arm processor can write to local e-core memory. Using the provided e_read / e_write gives you access to the appropriate mapped memory in arm memory. Also from the e-core side its e_read and e_write that give you access to the same memory but note the difference in api between hal and elib library routines (I created a simple channel definition that allowed both arm and e-core side code to use the same api (see https://github.com/peteasa/examples/blo ... /echnmem.c). The e-core can also access the local memory of another e-core. The e-core can also access the shared memory that is shared with e-cores and arm processor.

Next contention.. well the read and the write operation will be atomic but if you do anything clever like read modify write you have to lock the memory access. Multiple choices about how to do this. In my code I explored the possibility of having separate flag for each e-core and host.. for example a simple busy bit so that cpu 1 reads the location sees that the memory is busy and backs off till cpu 2 makes the memory active. Then cpu 2 backs off till cpu 1 has freed the memory. Using interrupts and putting the e-core cpu to sleep seemed like the best approach (for a very brief description https://peteasa.github.io/static_pages/ ... queue_cons). Another is to use memory barriers where the cpu spin till all the cpu are finished. Another is simple mutex again implemented with one of the cpu spinning till the other has unlocked....

Now for speed differences.. well speed of access of local memory and shared memory will be much the same. The limitation is the e-link connection. If you look at the epiphany architecture spec you will see that off chip accesses all go via a separate bus that is at least a factor of 8 slower than the on chip read and write buses. Since all off chip accesses have to go via that separate bus they are all impacted by the same performance reduction. Or looking at this for another angle the on chip accesses are not slowed down by the off chip accesses that will always be slower because of the physics of off chip accesses. Because the off chip accesses go via separate fifo queues then the on chip network on a chip is not slowed down. The way to speed that up is to have more of them. So I looked at connecting 2 or more epiphany chips via the separate chip to chip links.. this also works but you have to do more to get that going and you need additional hardware to connect the parallella boards together and the data appears on a different parallella!

Hope this helps! Best thing is to write something simple and try it out.. I started with the hello world example code and created several such simple examples to learn about the architecture.

Peter.

by **upcFrost** » Tue Mar 28, 2017 3:31 pm

If i understood correctly, the matter with the memory access to other cores is currently under discussion.

by **sebraa** » Tue Mar 28, 2017 5:46 pm

by **jar** » Tue Mar 28, 2017 7:36 pm

by **GreggChandler** » Tue Mar 28, 2017 9:52 pm

I don't see where the eSDK supports anything other than e_read() and e_write() to access core memory from the ARM host--at least in the document that I have (5.13.09.10). The option to map external memory, e_alloc(), does not include an option to map core memory. There don't appear to be any other documented calls in the indicated eSDK that will map core memory to the Arm address space either. I was trying to avoid modifying the eSDK, and stick to the published interfaces.

I have already written C/C++ code that shares and operates on data structures natively: Arm<->ExtMem, eCoreN<->ExtMem, eCoreN<->eCoreM. They do so without resorting to e_read() and e_write() once pointers have been correctly initialized and structures have been appropriately packed and aligned. I am trying to implement Arm<->eCoreN. My mutex and barrier algorithms work between the various eCores and Arm elegantly--without resorting to e_read() and e_write().

As for preventing code execution from a different core, I think that would be a mistake. I can envision a configuration where the kernel of an operating system might reside in one of the cores, and calling code on/in that core would be an efficient interface. I know it could be done with traps, etc., however, the direct call has a definite appeal--even if it would be slower.

As for memory access timings, my results so far indicate that eCore<->ExtMem is approximately twenty times slower than eCoreN<->eCoreN and ten times slower than eCoreN<->eCoreM. In other words, accessing external memory from an eCore is much slower than accessing another eCore's memory even if they do both traverse the eLink mesh.

by **peteasa** » Wed Mar 29, 2017 8:39 am

Hi Gregg,

I look at the published documents first and then that gives me a handle to look at the eSDK code that is built on the system. The code is the master document and the published documents are a guide to reading the code.

In my implementation of map reduce I dma data from arm memory to ecore memory and from ecore memory to ecore memory, I also allocate and free chunks of shared memory in arm code. To do that I have to know the physical address of shared and local memory on the e-core. I also have to know the virtual mapped address of the ecore memory from the arm code. The differences between the various accesses from different e-core/ arm cores is embodied in the structures e_memseg_t, e_group_config_t, and for arm side e_mem_t, e_epiphany_t. e_read() and e_write() use these to calculate the same physical addresses the I use with dma and message memory allocation. Agreed that this is not documented in the api but it is in the open source code and as we have full control over the code (actually kernel driver and eSDK libraries) it is not an issue for me.

On your memory access timing can I assume that your baseline eCoreN<->eCoreN is one e-core reading and writing to its own local memory and not traversing the network on a chip? I would expect that to be the fastest because no network on a chip. eCoreN<->eCoreM is traversing the network on a chip to local memory, and eCore<->ExtMem is between the eCore and the arm memory and would then be the slowest. From the other side arm <-> ExtMem would not traverse the network on a chip and arm <-> eCoreN would so the latter would be the slowest. Further I would expect arm <-> eCoreN and eCore<->ExtMem to be about the same speed, but this in part depends upon the details of the arm axi bus accesses. Arm <-> eCoreN accesses are driven by the arm side (fpga is axi slave) so will have different characteristics to eCore<->ExtMem that is driven by the fpga side (fpga is the axi master). Did I say that we also have full control over the fpga as the oh hdl code is also open source (this means that you can, and I have, simulate the accesses or even monitor the accesses if you so wish).

As to the scale of the differences between the various accesses I have left that for others to measure. I think that you found the network on a chip speed (does your measurement include latency.. another interesting topic) is about 10 times slower than local memory move without traverse of the network and that a further 10 times reduction occurs off chip (ie eCore<->ExtMem 20 times slower eCoreN<->eCoreN). Since 10 times slower and at least 7 times slower are about the same order of magnitude then this is what I would expect.

Hope this helps with understanding of what is going on under the hood!

Peter.

by **sebraa** » Wed Mar 29, 2017 12:53 pm

by **olajep** » Sat Apr 01, 2017 1:36 pm

by **GreggChandler** » Tue Apr 04, 2017 12:14 am

Thanks Ola, this is exactly what I was looking for! I understand it is not supported, however, it is enough of a performance win to code it up and try my code.

Parallella Community

HW support for mapping core memory to host address space

HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Re: HW support for mapping core memory to host address space

Who is online