Page 2 of 3

Re: Object Detection

PostPosted: Mon Aug 26, 2013 9:19 am
by notzed
Just posted another update after some hacking on the weekend. Although I still seem to have bugs the bugs are the same in both versions so I finally have some comparative performance numbers. A single epu core is coming in around 1/2 the speed of an ARM one, although i think i can both improve both and close the gap between them.

http://a-hackers-craic.blogspot.com.au/2013/08/a-bit-more-vj-progress-thoughts-on-elf.html

Re: Object Detection

PostPosted: Mon Aug 26, 2013 10:54 am
by aolofsson
notzed,

Great blog post! Really enjoyed reading it.

*It's possible to run gdb on code running on a core by first launching the 'e-server' and the launching e-gdb with your ecore (epu if you like) elf. It gets harder when you have a cooperative host/slave program.

*Not sure if you are using byte/short data formats at all, but there is a pretty big performance hit for loading anything non- 32 bit from local memory.

*The A9 ARM core is significantly more advanced than the Epiphany cores, so I suppose we shouldn't be too surprised. Still disappointed though.:-) On straight up floating point "filter type code", the results should be closer.

*Is the Parallella time including data transfer time to and from the Epiphany?

Thanks,
Andreas

Re: Object Detection

PostPosted: Wed Aug 28, 2013 5:45 am
by notzed
Thanks on the gdb info - i only need to test smallish routines which can be made standalone. Now i'm more familair with the ISA I can at least get stuff running.

Last night I also updated the post with some new numbers from using a dword aligned data structure - including from an optimised assembly implementation. I kept poking after writing the post and knocked another nearly 10% off the time shown.

The assembly implementation only uses ldrd for every access and where needed 16-bit values are extracted by shift/mask (apart from a couple of ldr's where data is padded for alignment). The c version compiles to code that does some ldrh's and calls some routine which doesn't seem necessary, i think for implementing "foo += float < float ? float : float" (looking it up: ___gtesf2), and anyway the format isn't really optimised for C address calculations since it includes the *4 for a float array offset. I will try a more c-friendly format at some point.

The timing is done host side in a way might not be terribly accurate, but it shouldn't be out more than a few (10's?) ms - it brackets a polled mailbox start/complete cycle.

The timing doesn't include copying the (precalculated) data to the shared memory space from the ARM side, but it does include all the dma/loading from system memory to LDS (local data storage). All dma is dwords. The image data dma is not double-buffered yet, but from my timing code it spends only about 33M cycles waiting for it - which is what, 50ms? That is for 1600 32x32xfloat transfers.

One reason i'm looking at viola-jones detection is the algorithm is kinda shitty as far as hardware is concerned (bandwidth intensive, naively cache unfriendly, simd unfriendly), and by thinking about how to fit it onto epiphany i think i ended up with a much better arm version than i started with - so the comparison to the arm version isn't that bad.

Once i've tried a few more things i'll put together a standalone GPL3 demo and upload it somewhere.

Re: Object Detection

PostPosted: Wed Aug 28, 2013 12:15 pm
by notzed
Ahh yeah, so because i kinda forgot about it at first, and then didn't think it would matter as it's such a small amount of memory compared to the rest of the calculation ... i wasn't dmaing in the variance-adjustment data. This is a float value which is read once per window location, or approx (512-20)*(512-20)~=242K times ... so ok, maybe it is a fair amount after-all.

After adding a 12x12 dma chained with the image window load: assembly version is under 1s, and the c version is just over 2s.

Yep, 1 epiphany cpu is now beating 1 arm core, even if it took hand-written assembly to do it ...

I also tried a bunch of other stuff like double-buffering and shaved a bit more off it, but it's getting a bit academic and none of the numbers are really verified.

FWIW I tried aborting the read-ahead dma, despite your warning on the other thread. I write 0 to the dma config register immediately followed by the new descriptor. They're writing to the same buffer. It appears to work and shaves another 0.06s off the running time (but as it's very tricky to validate i haven't verified the total calculation; it may just be luck it appears to work with my test case).

Re: Object Detection

PostPosted: Wed Aug 28, 2013 8:25 pm
by LamsonNguyen
This is really nice. Looking forward to the demo!

Re: Object Detection

PostPosted: Fri Aug 30, 2013 3:40 am
by notzed
Made some improvements to memory usage and generality and posted another update about it:

http://a-hackers-craic.blogspot.com.au/2013/08/now-this-looks-better.html

PS the dma-abort hack did end up causing problems, but i found a way to not need it.

Re: Object Detection

PostPosted: Wed Sep 04, 2013 12:19 pm
by notzed
Well this is a problem that just keeps on giving. Another (provisional!) jump in performance - this time simply through better instruction scheduling of just one loop. Although i must say they werre the hardest 33 "lines" of code i've written in a while (just not as sharp as i once was).

http://a-hackers-craic.blogspot.com.au/2013/09/that-scheduling.html

I have some multi-core numbers now too, although they don't mean much other than in relative terms.

Re: Object Detection

PostPosted: Wed Sep 04, 2013 4:42 pm
by Gravis
notzed wrote:this time simply through better instruction scheduling of just one loop. Although i must say they werre the hardest 33 "lines" of code i've written in a while

i've found the dual scheduling to be a fun challenge because i've never heard of this type of constraint. then again, i haven't worked with FPUs using asm and certainly not a chip with two unique IALU pipelines.

notzed's blog wrote:Well the compiler can only improve ...

not true! compilers sometimes apply multiple optimizations on top of each other or adjacent which can actually end up making it slower as a whole! i'm hoping gcc has done something in the last decade (last i checked) to fix this issue. it's a tricky issue, so i wouldn't be surprised if it still happens on occasion.

notzed's blog wrote:In total elapsed time terms these are something like 1.8s, 0.88s, and 0.60s from slowest to fastest on a single core. I only have a multi-core driver for the assembly versions. On 1 column of cores best is 201ms vs improved at 157ms. With all 16 cores ... identical at 87ms.

i was curious as to if your code could scales to lots and lots of cores (64 core chip and multi chip configurations).

it would be awesome if you could provide a C wrapper/header for your code too. :)

dare to dream.

Re: Object Detection

PostPosted: Wed Sep 04, 2013 11:07 pm
by notzed
Gravis wrote:
notzed wrote:this time simply through better instruction scheduling of just one loop. Although i must say they werre the hardest 33 "lines" of code i've written in a while

i've found the dual scheduling to be a fun challenge because i've never heard of this type of constraint. then again, i haven't worked with FPUs using asm and certainly not a chip with two unique IALU pipelines.

Really? NEON (cortex-a8 at least) has dual issue. CELL BE's SPU's have dual issue - with the added constraint that instructions for each pipeline must be in the correct instruction slot. IBM provided a timing tool that dumped a static analysis of the code though which made it easier to work with. With the added complication of SIMD it was possible to get 10x out of good code vs naive.

notzed's blog wrote:Well the compiler can only improve ...

not true! compilers sometimes apply multiple optimizations on top of each other or adjacent which can actually end up making it slower as a whole! i'm hoping gcc has done something in the last decade (last i checked) to fix this issue. it's a tricky issue, so i wouldn't be surprised if it still happens on occasion.

It's only software ... of course it can improve. 10 years is an absolute ├Žon in software even for something as mature as gcc.

But hey still have a way to go and maybe the epiphany gcc needs tuning.

notzed's blog wrote:In total elapsed time terms these are something like 1.8s, 0.88s, and 0.60s from slowest to fastest on a single core. I only have a multi-core driver for the assembly versions. On 1 column of cores best is 201ms vs improved at 157ms. With all 16 cores ... identical at 87ms.

i was curious as to if your code could scales to lots and lots of cores (64 core chip and multi chip configurations).

it would be awesome if you could provide a C wrapper/header for your code too. :)

dare to dream.


I'll find out when i get a 64-core chip. With the current tuning anything after 2 columns the mesh contention throttles the performance - so maybe it'll handle 16-cores on that.

The algorithm doesn't scale well due to the tables required being so large. Apart from the low-level code optimisation almost the entire effort of this endeavour has been to try to mitigate this problem. I'm not done yet and even the current implementation has tunables.

I'll publish a demo but likely wont be creating a library for it - i'd rather move on to the next thing.

Re: Object Detection

PostPosted: Wed Sep 04, 2013 11:34 pm
by aolofsson
notzed wrote:Well this is a problem that just keeps on giving. Another (provisional!) jump in performance - this time simply through better instruction scheduling of just one loop. Although i must say they werre the hardest 33 "lines" of code i've written in a while (just not as sharp as i once was).

http://a-hackers-craic.blogspot.com.au/2013/09/that-scheduling.html

I have some multi-core numbers now too, although they don't mean much other than in relative terms.


Nice! It almost looks like you are enjoying this exercise. :D Not sure if this is related, but note that the MOVTS/MOVFS instructions go through the mesh (not ideal :( ) and will appear as ext_stalls.(could be affecting the library timer calls...)

Andreas