Page 3 of 3

Re: Object Detection

PostPosted: Wed Sep 04, 2013 11:44 pm
by Gravis
notzed wrote:
Gravis wrote:i've found the dual scheduling to be a fun challenge because i've never heard of this type of constraint. then again, i haven't worked with FPUs using asm and certainly not a chip with two unique IALU pipelines.

Really? NEON (cortex-a8 at least) has dual issue. CELL BE's SPU's have dual issue - with the added constraint that instructions for each pipeline must be in the correct instruction slot. IBM provided a timing tool that dumped a static analysis of the code though which made it easier to work with. With the added complication of SIMD it was possible to get 10x out of good code vs naive.

i didn't know that about NEON or CELL BE SPUs which is in part because i dont write assembly unless it's going to run on a system without an OS and ARM9 chips have no NEON. i really do love assembly but i like reduced development time and portability much more.

Re: Object Detection

PostPosted: Mon Sep 09, 2013 1:38 am
by notzed
aolofsson wrote:
notzed wrote:Well this is a problem that just keeps on giving. Another (provisional!) jump in performance - this time simply through better instruction scheduling of just one loop. Although i must say they werre the hardest 33 "lines" of code i've written in a while (just not as sharp as i once was).

http://a-hackers-craic.blogspot.com.au/2013/09/that-scheduling.html

I have some multi-core numbers now too, although they don't mean much other than in relative terms.


Nice! It almost looks like you are enjoying this exercise. :D Not sure if this is related, but note that the MOVTS/MOVFS instructions go through the mesh (not ideal :( ) and will appear as ext_stalls.(could be affecting the library timer calls...)

Andreas


I mostly only got the thing for entertainment value tbh, so if it wasn't having some fun i'd go find something else to do - the fun is all in besting the challenges along the way. Although it wasn't exactly cheap it's far less than i spent on my last bicycle or television for something so unique.

The movts thing must be the cause as the C compiler litters them around (for the float/int switch I guess). It isn't a major influence on the runtime at least.

Re: Object Detection

PostPosted: Tue Sep 24, 2013 12:19 pm
by notzed
Small update ...

I had a pile of code to sew together in order to make a (more) complete implementation and i finally sat down to it tonight.

I'm just using the cpu to do the scaling and sat generation synchronously and then using the epiphany to do the detection. I'm using 5 resample scales from 1/2 to 1/4 inclusive and a 512x512 source and probing every location.

single-core arm only is about 600ms, using one epiphany core for the detection stage is a little bit faster, 4 cores is about 2x faster but by then it's spending 60% or so just on the scaling/sat tables.

So I'll have to move more onto the epu before there's anything worth talking about. It's an opportunity to play with pipelined processing anyway but I think I will come up against sdk limitations.

Re: Object Detection

PostPosted: Wed Sep 25, 2013 1:20 am
by Gravis
notzed wrote:It's an opportunity to play with pipelined processing anyway but I think I will come up against sdk limitations.


there's always assembly! :)

a great thing about pipelining large algos is that it reduces the amount of code any one core has to run which gives you more local memory to play with.
a bad thing is you have to carefully design the work distribution to prevent/fix bottlenecking.

Re: Object Detection

PostPosted: Wed Sep 25, 2013 4:13 am
by notzed
Gravis wrote:
notzed wrote:It's an opportunity to play with pipelined processing anyway but I think I will come up against sdk limitations.

there's always assembly! :)

a great thing about pipelining large algos is that it reduces the amount of code any one core has to run which gives you more local memory to play with.
a bad thing is you have to carefully design the work distribution to prevent/fix bottlenecking.


It's not problems with the compiler per-se, it's having to load different code into different cores and resolving addresses so they can then talk to each other. Even just loading the code gets a bit messy.

Resolving addresses with a single binary is a bit of a pain, but still workable using compile-time scripts. But it's just going to be wasting my time to try the same approach for multiple binaries. I have some ideas but they'll take a few days of hacking to explore.

I know the current solution is just to hardcode everything using fixed addresses,.... but that's way too Commodore 64 for me.

Re: Object Detection

PostPosted: Wed Sep 25, 2013 1:09 pm
by ysapir
gravis wrote:a bad thing is you have to carefully design the work distribution to prevent/fix bottlenecking.


You are also paying for data transfer times between pipeline stages. But, in a balanced system, this is more of a latency problem, then throughput, although the need for (double) buffering data arrays will eat your local memory faster than what repeated code will.

Re: Object Detection

PostPosted: Fri Oct 04, 2013 8:17 pm
by Greg Zuro
ysapir wrote:You are also paying for data transfer times between pipeline stages. But, in a balanced system, this is more of a latency problem, then throughput, although the need for (double) buffering data arrays will eat your local memory faster than what repeated code will.


What is the expect latency for adjacent core data transfers?

My hope for this board is exactly this sort of approach: Flow processing, etc.

I don't anticipate double buffering as a huge issue since data chunks will be quite small.

Re: Object Detection

PostPosted: Wed Oct 09, 2013 2:34 am
by notzed
Greg Zuro wrote:
ysapir wrote:You are also paying for data transfer times between pipeline stages. But, in a balanced system, this is more of a latency problem, then throughput, although the need for (double) buffering data arrays will eat your local memory faster than what repeated code will.


What is the expect latency for adjacent core data transfers?

My hope for this board is exactly this sort of approach: Flow processing, etc.

I don't anticipate double buffering as a huge issue since data chunks will be quite small.


In the manual it states 1.5 clock cycles per hop, and since writes are "non-blocking" you should be able to pipeline at something approaching maximum efficiency without having to marshal locally first and send with DMA. This allows you to save memory on the sender since only need to buffer on the receiving end and also saves the read+write for the buffering.

I was going to look at this kind of thing next but got sidetracked with the elf stuff and some outside distractions. And laziness.