e_read and e_write... I don't get it

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

e_read and e_write... I don't get it

Postby nickoppen » Thu Apr 20, 2017 7:45 am

I've been writing a program to demonstrate how DMA works but in my research it seems that for a transfer of less than about 300 bytes, e_read and e_write are quicker (e.g. http://blog.codu.in/parallella/epiphany/ebsp/2016/03/02/benchmarking-the-parallella.html). I thought that I'd include a worked example of e_read and e_write to cover the topic more broadly.

The documentation shows the prototype:

void *e_write(void *remote, const void *src, unsigned row, unsigned col, void *dst, size_t bytes);

I think I understand how it works for on-chip (core to core) communication.

For off-chip communication, the interaction between the arguments remote and dst is described thus: If the remote parameter is e_emem_config, then the destination address is given relative to the External Memory base address.

I have not found any example or any description of how to provide a remote address relative to the "External Memory base address". I'm not even sure what memory is being referred to here. I assume shared memory?

I have 256 bytes shared memory that I want to use on every core:

Code: Select all
uint8_t map[256];

int sizeOfMap = 256 * sizeof(uint8_t); /// not really necessary but I'm being neat
coprthr_mem_t eMap = coprthr_dmalloc(dd, sizeOfMap);

args.g_map = (void*)coprthr_memptr(eMap);  /// where args are the arguments passed to the kernel


Inside the kernel I want to copy pArgs->g_map into a local array.

How do I translate the address into one relative to the external memory base address?

Any assistance would be most appreciated.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 262
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: e_read and e_write... I don't get it

Postby sebraa » Thu Apr 20, 2017 8:58 am

Hi,

this is an excerpt from my host program, which uses e_read and e_write to access shared memory. Notice that "my_shm_struct" is allocated in shared memory with an offset of 16 MB (the lower 16 MB are reserved for libc-related things, see the linker script), but that I use an offset of 0 to access the first member (which is a 32 bit pollflag telling me whether I need to copy the full struct into local memory). For brevity, I have removed all error handling.

Code: Select all
int main() {
    e_epiphany_t dev;
    e_mem_t mem;

    e_init(NULL);
    e_reset_system();
    e_open(&dev, 0, 0, 4, 4);
    e_alloc(&mem, 0x01000000, sizeof(my_shm_structure));
    e_write(&mem, 0, 0, (off_t)0, &my_shm_data, sizeof(my_shm_struct));

    for(int y = 0; y < 4; y++)
        for(int x = 0; x < 4; x++)
            e_load("kernel.elf", &dev, x, y, E_TRUE);

    while(1) {
        while(1) {
            /* first element in shm is a flag - poll it */
            e_read(&mem, 0, 0, (off_t)0, &pollflag, sizeof(uint32_t));
            if(pollflag != POLL_BUSY) break;
        }

        /* read full shm structure and break if done */
        e_read(&mem, 0, 0, (off_t)0, &my_shm_data, sizeof(my_shm_struct));
        if(pollflag == POLL_DONE) break;

        /* reset flag */
        pollflag = POLL_BUSY;
        e_write(&mem, 0, 0, (off_t)0, &pollflag, sizeof(uint32_t));

        /* one iteration is done, handle data in my_shm_data */
    }

    e_free(&mem);
    e_close(&dev);
    e_finalize();
    return 0;
}


I hope this helps (and serves as a usable example for you).

Best Regards,
sebraa
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: e_read and e_write... I don't get it

Postby GreggChandler » Fri Apr 21, 2017 4:25 am

I don't use e_read() or e_write() in my code at all. Once I use e_alloc() on the host as sebraa did in his sample, I then use the '.base' member of 'e_mem_t' to determine the base of where the mapped window appears within the host memory address space. As necessary, I then construct/cache pointers to data structures that I care about in shared memory. Thus, e_read() and e_write() turn into "normal" pointer operations. With Ola's help, I also created an extension to the eSDK that facilities the same functionality for Epiphany core memory.

Similarly, on the Epiphany side of things, I use e_get_global_address() to create pointers to memory in other cores. Additionally, I use 'e_emem_config.base' to determine the base address of the external memory from the Epiphany perspective. Thus my e_read() and e_write() are again replaced by pointer operations.

Having reduced things to pointer operations creates three problems. The first, common to either way of doing things (e_read/e_write and pointer arithmetic), is that of synchronization. I implemented some primitives that let me create a common mutex across/between/among the Epiphany cores and the ARM cores. The second is to remember when accessing memory across core/chip boundaries is the 'weak memory order model'. The third is that pointers in memory shared between the ARM and Epiphany are problematic, that is to say, a pointer to a location in shared memory or Epiphany core memory is different than a pointer to that same location from the Epiphany point of view. I solved this by storing offsets from the prospective base pointers, which can be the same. I wrap all of this in C++ classes, and conditionally compile the class to use which ever base is appropriate--the ARM base or the Epiphany base. My class library also supports dynamically allocating/referencing shared memory, from either side of the interface, and when I get time, I plan to add dynamically allocate/reference memory in other Epiphany cores.

I hope my explanation is not to confusing. The end result of the libraries I have written, and the way that I have written them, is that the code I write for the Epiphany looks very similar to the code I write for the ARM. It also avoids hard coded offsets to blocks of external or Epiphany core memory. My code looks like C or C++ code, rather than code with lots of calls to e_read() and e_write(). I push some of the less frequently used code to external memory so that I can reserve Epiphany core memory for the 'good' stuff.

http://parallella.org/forums/viewtopic. ... 4ae9651cc8
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: e_read and e_write... I don't get it

Postby nickoppen » Fri Apr 21, 2017 10:56 am

Wow! That's a bigger can of worms than I expected.

Thank you both for your comments. I'll give Gregg's method a try. If I just do a read that will at least avoid the synchronization issues.

My objective is to write a blog post about host<=>Epiphany data transfer that demonstrates reliable and reasonably efficient ways of writing parallel programs. If I can't get to a solution that I can explain succinctly, I'll stick to using dma (although not really the best for small amounts of data) or something like memcopy that is at least well understood.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 262
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: e_read and e_write... I don't get it

Postby GreggChandler » Fri Apr 21, 2017 1:00 pm

Well, I hope that I didn't make sound more complicated than it really is. Although I've been writing C code for over forty years, I jumped on the C++ band-wagon quite a few years ago, and C++ makes some of this stuff pretty easy! It is also easily doable in C--just a bit more verbose. (Templates are your friend here.)

In my case, sharing data between the host and eCore is more a matter of just putting it in the external shared memory in a way that the other processor can easily find it. To that end, my external memory allocator supports (requires) naming of the allocated segments. Think of it like a named malloc(). The other side can then do a lookup based upon the name and some other parameters. The returned value works for the side querying the address. Thus, I can write library code that dynamically allocates and uses shared memory without all of the hard coded constants that need to be carefully maintained.

If a module has a number of variables that need to be shared, I generally put them in a 'struct' which is allocated in shared memory. The allocator/creator of the 'struct' effectively publishes the data, and can do so in place in external memory. (Think about fread() directly into an external memory buffer on the host, etc.) The other side constructs it's own pointer to the external 'struct' via a library that does a name based look up. (I actually use some additional parameters such as size, core group id, etc.) It then uses the pointer to access the data fairly efficiently. I avoid sharing some of the fancy C++ data types, and use a buffer of char's rather than a 'std::string', or simple C arrays rather 'std::vector's although these could be made to work.

A potential problem with the shared 'structs' that one must be careful to avoid is data mis-alignment. One must insure that the 'structs' are laid out equivalently in memory on each side based upon two different compiler/machine architectures. GCC supports attributes that pack and align them to facilitate this. Also, by doing all of this work in libraries, I only have to do it once, and the application code becomes much simpler. The complex code must only be debugged once.
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: e_read and e_write... I don't get it

Postby sebraa » Sun Apr 23, 2017 1:29 pm

GreggChandler wrote:I don't use e_read() or e_write() in my code at all.
For this I blame Adaptevas eSDK documentation in 2014, when I wrote most of this code. The only officially documented way to access Epiphany cores or shared memory was by using the e_read/e_write functions. Also, I wanted to stay as close to the eSDK as possible, so I avoided changing the linker script (leading to the use of a single shared, packed shm structure as outlined above) and some tricky build systems. I paid a lot for both decisions, but then, hindsight is always better.

GreggChandler wrote:My code looks like C or C++ code, rather than code with lots of calls to e_read() and e_write().
The code I posted just polls the Epiphany for events (I don't think it is possible to avoid polling) and synchronizes in those events, which was good enough for my purposes. All other code is free of Epiphany references.

nickoppen wrote:or something like memcopy that is at least well understood.
Keep in mind that memcpy is slow for small amounts of data. For copying 64 bit using memcpy, I have measured ~160 cycles overhead.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: e_read and e_write... I don't get it

Postby GreggChandler » Mon Apr 24, 2017 1:34 pm

sebraa wrote:I blame Adaptevas eSDK documentation


I agree with sebraa, the eSDK documentation is not very clear here. The sample programs that I looked at didn't deal with most of these issues at all, but rather "hard coded" external memory access with constants, and inserted long sleeps waiting for some code to complete or stop. Hopefully, sharing some of the ideas that I wished I had known at the start will save others some time.

sebraa wrote:(I don't think it is possible to avoid polling


I also agree that polling is one way to accomplish some of this synchronization, although I do it via synchronization primitives. I have not experimented with the Epiphany interrupt system, however, it would be an alternative to polling--although I am not sure whether the Epiphany can interrupt the ARM. From the eSDK, it is clear that the ARM can interrupt the Epiphany.

As an aside, this thread inspired me to code a DMA vs memcpy() benchmark. In my benchmark, I timed single core numbers, and also numbers when other cores were also doing DMA/memcpy(). The numbers further inform one's understanding of the Epiphany architecture.
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: e_read and e_write... I don't get it

Postby nickoppen » Wed Apr 26, 2017 12:04 am

Greg, how did the two methods compare?

I've read a few different analyses comparing e_read and e_write with DMA but they are mostly silent about memcpy.

I wanted to use e_read/e_write because it seems that for less than 300 bytes they are quickest if you can use burst mode. I believe that burst mode is used when you are copying consecutive chunks of 8 bytes which I am.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 262
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: e_read and e_write... I don't get it

Postby jar » Wed Apr 26, 2017 3:53 am

The fastest device-side method I've found to copy data has been included in the ARL OpenSHMEM library in shmemx_memcpy.c: https://github.com/USArmyResearchLab/op ... x_memcpy.c

If you're just copying data from a core to a a remote core or DRAM and not trying to perform asynchronous DMAs with other computation, try this. In general, I've found trying to overlap computation and communication (via asynchronous DMA) not to be worth it. There are two hardware bugs that throttle the DMA engine in E3 that prevent it from achieving peak performance (it's closer to 1.4 GB/s). Properly unrolled double-word loads and stores are faster -- approaching 2.4 GB/s (8 bytes every 2 cycles). See below.

This routine is a general memcpy routine which will also accept misaligned data and arrays that are offset 1-7 bytes and have remainders of 1-7 bytes. I spent quite some time on it and validated it for all different sizes. The routine has also been optimized for instruction size (something like 176 bytes, IIRC). The routine is used as inline assembly so that the OpenSHMEM library can be used as a header-only library, leading to generally faster code and smaller program sizes. A lot of work went into the library in the past couple months to improve the usability and stability.

The ARL OpenSHMEM for Epiphany library should be able to be compiled for those of you using the eSDK and not COPRTHR. I would appreciate feedback (bugs/requests) on GitHub.

Epiphany has excellent "put" performance, but crummy "get" performance since the request has to traverse the mesh network and then back, stalling the core until the result is returned. There is an interprocessor interrupt that can be used to force a remote core to write data to the local core. The overhead for this routine is surprisingly small (see below).

The following benchmarks were compiled with SHMEM_USE_HEADER_ONLY and are available in the ./test directory of the repository.

Put (peak ~2.24 GB/s):
Code: Select all
$ coprsh -np 1 ./put.x
COPRTHR-2-BETA (Anthem) build 20160630.1527
# SHMEM PutMem times for variable message size
# Bytes   Latency (nanoseconds)
     1      73
     2      88
     4     118
     8     179
    16      79
    32      83
    64     124
   128     151
   256     204
   512     313
  1024     531
  2048     966
  4096    1836
  8192    3574


Non-blocking Put which subsequently blocks (peak ~1.41 GB/s):
Code: Select all
$ coprsh -np 1 ./put_nb.x
COPRTHR-2-BETA (Anthem) build 20160630.1527
# SHMEM Non-Blocking PutMem times for variable message size
# Bytes   Latency (nanoseconds)
     1     108
     2     108
     4     108
     8     108
    16     108
    32     108
    64     108
   128     108
   256     211
   512     408
  1024     751
  2048    1448
  4096    2864
  8192    5673


Get (peak ~305 MB/s):
Code: Select all
$ coprsh -np 1 ./get.x
COPRTHR-2-BETA (Anthem) build 20160630.1527
# SHMEM GetMem times for variable message size
# Bytes   Latency (nanoseconds)
     1      91
     2     124
     4     194
     8     328
    16     121
    32     169
    64     293
   128     501
   256     909
   512    1728
  1024    3359
  2048    6623
  4096   13153
  8192   26221


Get using Interprocessor Interrupt (peak ~2.1 GB/s):
Code: Select all
$ coprsh -np 1 ./get_ipi.x
COPRTHR-2-BETA (Anthem) build 20160630.1527
# SHMEM GetMem times for variable message size
# Bytes   Latency (nanoseconds)
     1      93
     2     128
     4     194
     8     329
    16     124
    32     171
    64     293
   128     394
   256     448
   512     558
  1024     774
  2048    1208
  4096    2078
  8192    3818
User avatar
jar
 
Posts: 288
Joined: Mon Dec 17, 2012 3:27 am

Re: e_read and e_write... I don't get it

Postby GreggChandler » Wed Apr 26, 2017 5:41 am

My tests are from the Epiphany perspective, and demonstrate, as expected, that a significant issue with these measurements is other core activity. The DMA appears to execute through the mesh. I used WAND for my test barrier. I used the e_dma_copy() eSDK function. I also wrote my own locmemcpy() in C++ which did 64-bit reads and writes, but these results use the standard memcpy() executing in core memory. Where possible, I also used e_read()/e_write() also executing from core memory. I did tests for each of the cores in the group. (Group size is configurable via a switch.) I also executed tests with the entire group executing the tests simultaneously. I also used E_CTIMER_0 in E_TIMER_CLK mode. I disable interrupts for the duration of the test.

I was unable to achieve 64 bits per instruction cycle. I presume that implies a problem with my methodology. What have I missed?

test.txt
(3.64 KiB) Downloaded 54 times
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Next

Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 2 guests

cron