Parallella Community

Posted: **Wed Apr 26, 2017 7:05 am**

Well, the numbers that I published are from the built in memcpy()--which one would hope has been optimized at least a little. (Although my ten minute hack is pretty close to the library function.) I also noticed, after publishing my data, that something strange was going on in the last run. All observations should have 160 samples. All of the trials were not being displayed. When I switched back to my barrier, the trials appear correctly. Apparently the WAND instruction is "experimental". Since I found no documentation as to what is "experimental" about WAND, i.e. what works and what doesn't, I will modify my barriers to idle and wait for interrupts. The current polling in the barriers significantly impacts the DMA measurements as it generates mesh traffic. Although, perhaps that better reflects a real world benchmark? The reason I tested with all of the cores executing similar tests is that it, hopefully, would be rare for only one core to be working on a problem.

As an aside, my C++ locmemcpy() did generate ldrd/strd instructions.

Posted: **Wed Apr 26, 2017 1:07 pm**

The default memcpy that I'm looking at just uses 32-bit loads and stores only when the memory is aligned. It does not use a hardware loop so it wastes cycles decrementing the loop indexes and branching. It doesn't load data at least two clock cycles before the store, causing the pipeline to stall for one cycle after each load. It seems it doesn't even use the post-modify increment/decrement instruction on the loads and stores, requiring additional clock cycles to compute the index offset. That's four obvious performance issues if we're counting. It is not optimized well and there's no reason why you should expect to get better performance than vanilla C code (from which it is compiled). The routine is also larger in size and slower than the one I have provided. In summary, the default memcpy is a very poor implementation, but go ahead and use it if that's what you want.

I've had success with the WAND barrier, but it only works if all 16 Epiphany cores participate. It is actually an optional experimental feature in the OpenSHMEM library for the shmem_barrier_all (you must enable it. default is a software barrier).

Posted: **Wed Apr 26, 2017 4:57 pm**

I am not using memcpy(), however, I was asked to compare to it. I think the numbers clearly indicate that one can improve upon e_read() and e_write()--which probably weren't optimized for large transfers. (I haven't examined the source to verify this conjecture however.)

I agree that it is surprising that auto increment is not used in the compiled code. I found the same to be true of my compiled code. The register optimizations in generated code could be better as well--but that is hopefully always true with hand optimized code. It was clear from the compiled code for my locmemcpy() that hand coding in assembly will improve it. I am not sure whether I want to learn yet another instruction set, however, it may be advantageous here. My C++ locmemcpy() code appears to max out at 2-bytes per cycle.

My locmemcpy() does appear to beat DMA at times, which does surprise me--especially with 1024-byte buffers. I had previously read the 300-byte transfer number referred to by nickoppen somewhere before--don't know where. Consequently, I expected DMA to beat my locmemcpy() always with 1024-byte buffers. It doesn't. That led me to suspect my test methodology.

The current code idles the non-active processors while the single core tests are executed, and wakes them up at the end of the test. Each test is repeated ten times on each individual core, and the combined core test is also repeated ten times.

test.txt: (5.04 KiB) Downloaded 1240 times

Posted: **Wed Apr 26, 2017 9:12 pm**

e_read() and e_write() are using the slow memcpy (after some address translation for the destination)
https://github.com/adapteva/epiphany-li ... mem_read.c
https://github.com/adapteva/epiphany-li ... em_write.c

Just so it's clear, the source code for memcpy is here:
https://github.com/adapteva/epiphany-gc ... c/memcpy.c
It is unoptimized, but the compiler does generate a "fast" word-aligned code path as I described. If the entire transfer were ldrb/strb byte operations, it would be particularly bad.

The DMA engine has some setup time which cannot be completely mitigated. At no point will a DMA transfer beat shmemx_memcpy because the Epiphany-III DMA engine is hobbled by hardware errata. It should not surprise you that your locmemcpy beats it.

The shmemx_memcpy does not return the dest address value to r0 like memcpy, so it's not exactly a drop-in replacement in some cases where the return value is required (it's generally not).

I could fix the e-lib implementation so that it would be higher performance in order to benefit people using it. But it's a poor interface that I don't believe in. It's not a standard API for any level of portability, uses a clunky/fragile 2D row/col interface to specify cores, doesn't manage memory, doesn't handle collective operations, doesn't provide atomic operations, doesn't guarantee memory ordering, and doesn't provide a simple DMA interface. If I were to re-define the interface to be better, it would end up looking a lot like the OpenSHMEM API.

Posted: **Thu Apr 27, 2017 2:28 am**

I was wrong! When I pulled out the instruction reference PDF, and pulled up the generated assembly code for my locmemcpy() function (I used -S with a verbose option rather than e-objdump), I determined that for the most unwound loop of 8 x 64-bits of moving, the compiler was using auto-incrementing 64-bit ldrd/strd instructions. Furthermore, it was also clever enough to embed some of the other housekeeping calculations between the load and store to optimize. It also adjusted subsequent instructions after the housekeeping was done. It started off with a base address, but when the 8x64 increment was done (between load and stores), it generated instructions that automatically subtracted from the base. Very clever. I was impressed. Hats off to whoever generated that code! It appears to understand the pipeline. I used -O3 to generate the code. I may not have used any optimizations on my prior compiles. Unfortunately, it ran out of other calculations to interleave, which probably generated code with stalls. I think I will try to capture that metric.

The section of code that didn't use the auto increment was the tail where the remainder of 7-1 x 64-bits was executed. It did, however, put the add instructions between the ldr and str instructions again, which was somewhat clever. I am not familiar enough with the instruction set yet to see if it put two 16-bit instructions in, but it was pretty clever. Hand optimization of the generated code will only be slightly more clever--and the payoff isn't that great in the tail. The last tail where the 7-1 x 8-bits transfers were done was also sub-optimal, however, the optimizer was likely confused by the somewhat non-standard, and somewhat optimal, C/C++ code. A little hand optimization would probably also help there as well. It still was reasonably clever.

My conclusion is that the optimizations were pretty good. A person can probably do better, but nonetheless the compiler is pretty good. This is much better than the early (late 70's and early 80's) MIT compilers and assemblers I worked with. https://en.wikipedia.org/wiki/NuMachine

Parallella Community

e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it

Re: e_read and e_write... I don't get it