Hi Gregg,
Yes this is possible. Each core maps to a different base address in the arm processor memory, for example when an application is launched on the e-core the host first saves the program data to local e-core memory then writes a structure to tell the e-core which e-core it is in the group then causes the e-core to start running. All this happens because the arm processor can write to local e-core memory. Using the provided e_read / e_write gives you access to the appropriate mapped memory in arm memory. Also from the e-core side its e_read and e_write that give you access to the same memory but note the difference in api between hal and elib library routines (I created a simple channel definition that allowed both arm and e-core side code to use the same api (see
https://github.com/peteasa/examples/blo ... /echnmem.c). The e-core can also access the local memory of another e-core. The e-core can also access the shared memory that is shared with e-cores and arm processor.
Next contention.. well the read and the write operation will be atomic but if you do anything clever like read modify write you have to lock the memory access. Multiple choices about how to do this. In my code I explored the possibility of having separate flag for each e-core and host.. for example a simple busy bit so that cpu 1 reads the location sees that the memory is busy and backs off till cpu 2 makes the memory active. Then cpu 2 backs off till cpu 1 has freed the memory. Using interrupts and putting the e-core cpu to sleep seemed like the best approach (for a very brief description
https://peteasa.github.io/static_pages/ ... queue_cons). Another is to use memory barriers where the cpu spin till all the cpu are finished. Another is simple mutex again implemented with one of the cpu spinning till the other has unlocked....
Now for speed differences.. well speed of access of local memory and shared memory will be much the same. The limitation is the e-link connection. If you look at the epiphany architecture spec you will see that off chip accesses all go via a separate bus that is at least a factor of 8 slower than the on chip read and write buses. Since all off chip accesses have to go via that separate bus they are all impacted by the same performance reduction. Or looking at this for another angle the on chip accesses are not slowed down by the off chip accesses that will always be slower because of the physics of off chip accesses. Because the off chip accesses go via separate fifo queues then the on chip network on a chip is not slowed down. The way to speed that up is to have more of them. So I looked at connecting 2 or more epiphany chips via the separate chip to chip links.. this also works but you have to do more to get that going and you need additional hardware to connect the parallella boards together and the data appears on a different parallella!
Hope this helps! Best thing is to write something simple and try it out.. I started with the hello world example code and created several such simple examples to learn about the architecture.
Peter.