Parallella Community

by **GreggChandler** » Fri Apr 21, 2017 4:25 am

I don't use e_read() or e_write() in my code at all. Once I use e_alloc() on the host as sebraa did in his sample, I then use the '.base' member of 'e_mem_t' to determine the base of where the mapped window appears within the host memory address space. As necessary, I then construct/cache pointers to data structures that I care about in shared memory. Thus, e_read() and e_write() turn into "normal" pointer operations. With Ola's help, I also created an extension to the eSDK that facilities the same functionality for Epiphany core memory.

Similarly, on the Epiphany side of things, I use e_get_global_address() to create pointers to memory in other cores. Additionally, I use 'e_emem_config.base' to determine the base address of the external memory from the Epiphany perspective. Thus my e_read() and e_write() are again replaced by pointer operations.

Having reduced things to pointer operations creates three problems. The first, common to either way of doing things (e_read/e_write and pointer arithmetic), is that of synchronization. I implemented some primitives that let me create a common mutex across/between/among the Epiphany cores and the ARM cores. The second is to remember when accessing memory across core/chip boundaries is the 'weak memory order model'. The third is that pointers in memory shared between the ARM and Epiphany are problematic, that is to say, a pointer to a location in shared memory or Epiphany core memory is different than a pointer to that same location from the Epiphany point of view. I solved this by storing offsets from the prospective base pointers, which can be the same. I wrap all of this in C++ classes, and conditionally compile the class to use which ever base is appropriate--the ARM base or the Epiphany base. My class library also supports dynamically allocating/referencing shared memory, from either side of the interface, and when I get time, I plan to add dynamically allocate/reference memory in other Epiphany cores.

I hope my explanation is not to confusing. The end result of the libraries I have written, and the way that I have written them, is that the code I write for the Epiphany looks very similar to the code I write for the ARM. It also avoids hard coded offsets to blocks of external or Epiphany core memory. My code looks like C or C++ code, rather than code with lots of calls to e_read() and e_write(). I push some of the less frequently used code to external memory so that I can reserve Epiphany core memory for the 'good' stuff.

http://parallella.org/forums/viewtopic. ... 4ae9651cc8

by **nickoppen** » Fri Apr 21, 2017 10:56 am

Wow! That's a bigger can of worms than I expected.

Thank you both for your comments. I'll give Gregg's method a try. If I just do a read that will at least avoid the synchronization issues.

My objective is to write a blog post about host<=>Epiphany data transfer that demonstrates reliable and reasonably efficient ways of writing parallel programs. If I can't get to a solution that I can explain succinctly, I'll stick to using dma (although not really the best for small amounts of data) or something like memcopy that is at least well understood.

nick

by **GreggChandler** » Fri Apr 21, 2017 1:00 pm

Well, I hope that I didn't make sound more complicated than it really is. Although I've been writing C code for over forty years, I jumped on the C++ band-wagon quite a few years ago, and C++ makes some of this stuff pretty easy! It is also easily doable in C--just a bit more verbose. (Templates are your friend here.)

In my case, sharing data between the host and eCore is more a matter of just putting it in the external shared memory in a way that the other processor can easily find it. To that end, my external memory allocator supports (requires) naming of the allocated segments. Think of it like a named malloc(). The other side can then do a lookup based upon the name and some other parameters. The returned value works for the side querying the address. Thus, I can write library code that dynamically allocates and uses shared memory without all of the hard coded constants that need to be carefully maintained.

If a module has a number of variables that need to be shared, I generally put them in a 'struct' which is allocated in shared memory. The allocator/creator of the 'struct' effectively publishes the data, and can do so in place in external memory. (Think about fread() directly into an external memory buffer on the host, etc.) The other side constructs it's own pointer to the external 'struct' via a library that does a name based look up. (I actually use some additional parameters such as size, core group id, etc.) It then uses the pointer to access the data fairly efficiently. I avoid sharing some of the fancy C++ data types, and use a buffer of char's rather than a 'std::string', or simple C arrays rather 'std::vector's although these could be made to work.

A potential problem with the shared 'structs' that one must be careful to avoid is data mis-alignment. One must insure that the 'structs' are laid out equivalently in memory on each side based upon two different compiler/machine architectures. GCC supports attributes that pack and align them to facilitate this. Also, by doing all of this work in libraries, I only have to do it once, and the application code becomes much simpler. The complex code must only be debugged once.

by **sebraa** » Sun Apr 23, 2017 1:29 pm

by **GreggChandler** » Mon Apr 24, 2017 1:34 pm

by **nickoppen** » Wed Apr 26, 2017 12:04 am

Greg, how did the two methods compare?

I've read a few different analyses comparing e_read and e_write with DMA but they are mostly silent about memcpy.

I wanted to use e_read/e_write because it seems that for less than 300 bytes they are quickest if you can use burst mode. I believe that burst mode is used when you are copying consecutive chunks of 8 bytes which I am.

nick

by **jar** » Wed Apr 26, 2017 3:53 am

by **GreggChandler** » Wed Apr 26, 2017 5:41 am

My tests are from the Epiphany perspective, and demonstrate, as expected, that a significant issue with these measurements is other core activity. The DMA appears to execute through the mesh. I used WAND for my test barrier. I used the e_dma_copy() eSDK function. I also wrote my own locmemcpy() in C++ which did 64-bit reads and writes, but these results use the standard memcpy() executing in core memory. Where possible, I also used e_read()/e_write() also executing from core memory. I did tests for each of the cores in the group. (Group size is configurable via a switch.) I also executed tests with the entire group executing the tests simultaneously. I also used E_CTIMER_0 in E_TIMER_CLK mode. I disable interrupts for the duration of the test.

I was unable to achieve 64 bits per instruction cycle. I presume that implies a problem with my methodology. What have I missed?

test.txt: (3.64 KiB) Downloaded 1210 times

Parallella Community

e_read and e_write... I don't get it

e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Who is online