Object Detection

Re: Object Detection

Postby Gravis » Wed Sep 04, 2013 11:44 pm

notzed wrote:
Gravis wrote:i've found the dual scheduling to be a fun challenge because i've never heard of this type of constraint. then again, i haven't worked with FPUs using asm and certainly not a chip with two unique IALU pipelines.

Really? NEON (cortex-a8 at least) has dual issue. CELL BE's SPU's have dual issue - with the added constraint that instructions for each pipeline must be in the correct instruction slot. IBM provided a timing tool that dumped a static analysis of the code though which made it easier to work with. With the added complication of SIMD it was possible to get 10x out of good code vs naive.

i didn't know that about NEON or CELL BE SPUs which is in part because i dont write assembly unless it's going to run on a system without an OS and ARM9 chips have no NEON. i really do love assembly but i like reduced development time and portability much more.
User avatar
Gravis
 
Posts: 445
Joined: Mon Dec 17, 2012 3:27 am
Location: East coast USA.

Re: Object Detection

Postby notzed » Mon Sep 09, 2013 1:38 am

aolofsson wrote:
notzed wrote:Well this is a problem that just keeps on giving. Another (provisional!) jump in performance - this time simply through better instruction scheduling of just one loop. Although i must say they werre the hardest 33 "lines" of code i've written in a while (just not as sharp as i once was).

http://a-hackers-craic.blogspot.com.au/2013/09/that-scheduling.html

I have some multi-core numbers now too, although they don't mean much other than in relative terms.


Nice! It almost looks like you are enjoying this exercise. :D Not sure if this is related, but note that the MOVTS/MOVFS instructions go through the mesh (not ideal :( ) and will appear as ext_stalls.(could be affecting the library timer calls...)

Andreas


I mostly only got the thing for entertainment value tbh, so if it wasn't having some fun i'd go find something else to do - the fun is all in besting the challenges along the way. Although it wasn't exactly cheap it's far less than i spent on my last bicycle or television for something so unique.

The movts thing must be the cause as the C compiler litters them around (for the float/int switch I guess). It isn't a major influence on the runtime at least.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: Object Detection

Postby notzed » Tue Sep 24, 2013 12:19 pm

Small update ...

I had a pile of code to sew together in order to make a (more) complete implementation and i finally sat down to it tonight.

I'm just using the cpu to do the scaling and sat generation synchronously and then using the epiphany to do the detection. I'm using 5 resample scales from 1/2 to 1/4 inclusive and a 512x512 source and probing every location.

single-core arm only is about 600ms, using one epiphany core for the detection stage is a little bit faster, 4 cores is about 2x faster but by then it's spending 60% or so just on the scaling/sat tables.

So I'll have to move more onto the epu before there's anything worth talking about. It's an opportunity to play with pipelined processing anyway but I think I will come up against sdk limitations.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: Object Detection

Postby Gravis » Wed Sep 25, 2013 1:20 am

notzed wrote:It's an opportunity to play with pipelined processing anyway but I think I will come up against sdk limitations.


there's always assembly! :)

a great thing about pipelining large algos is that it reduces the amount of code any one core has to run which gives you more local memory to play with.
a bad thing is you have to carefully design the work distribution to prevent/fix bottlenecking.
User avatar
Gravis
 
Posts: 445
Joined: Mon Dec 17, 2012 3:27 am
Location: East coast USA.

Re: Object Detection

Postby notzed » Wed Sep 25, 2013 4:13 am

Gravis wrote:
notzed wrote:It's an opportunity to play with pipelined processing anyway but I think I will come up against sdk limitations.

there's always assembly! :)

a great thing about pipelining large algos is that it reduces the amount of code any one core has to run which gives you more local memory to play with.
a bad thing is you have to carefully design the work distribution to prevent/fix bottlenecking.


It's not problems with the compiler per-se, it's having to load different code into different cores and resolving addresses so they can then talk to each other. Even just loading the code gets a bit messy.

Resolving addresses with a single binary is a bit of a pain, but still workable using compile-time scripts. But it's just going to be wasting my time to try the same approach for multiple binaries. I have some ideas but they'll take a few days of hacking to explore.

I know the current solution is just to hardcode everything using fixed addresses,.... but that's way too Commodore 64 for me.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: Object Detection

Postby ysapir » Wed Sep 25, 2013 1:09 pm

gravis wrote:a bad thing is you have to carefully design the work distribution to prevent/fix bottlenecking.


You are also paying for data transfer times between pipeline stages. But, in a balanced system, this is more of a latency problem, then throughput, although the need for (double) buffering data arrays will eat your local memory faster than what repeated code will.
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Object Detection

Postby Greg Zuro » Fri Oct 04, 2013 8:17 pm

ysapir wrote:You are also paying for data transfer times between pipeline stages. But, in a balanced system, this is more of a latency problem, then throughput, although the need for (double) buffering data arrays will eat your local memory faster than what repeated code will.


What is the expect latency for adjacent core data transfers?

My hope for this board is exactly this sort of approach: Flow processing, etc.

I don't anticipate double buffering as a huge issue since data chunks will be quite small.
Greg Zuro
 
Posts: 14
Joined: Sun Jan 06, 2013 10:58 am

Re: Object Detection

Postby notzed » Wed Oct 09, 2013 2:34 am

Greg Zuro wrote:
ysapir wrote:You are also paying for data transfer times between pipeline stages. But, in a balanced system, this is more of a latency problem, then throughput, although the need for (double) buffering data arrays will eat your local memory faster than what repeated code will.


What is the expect latency for adjacent core data transfers?

My hope for this board is exactly this sort of approach: Flow processing, etc.

I don't anticipate double buffering as a huge issue since data chunks will be quite small.


In the manual it states 1.5 clock cycles per hop, and since writes are "non-blocking" you should be able to pipeline at something approaching maximum efficiency without having to marshal locally first and send with DMA. This allows you to save memory on the sender since only need to buffer on the receiving end and also saves the read+write for the buffering.

I was going to look at this kind of thing next but got sidetracked with the elf stuff and some outside distractions. And laziness.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Previous

Return to Image and Video Processing

Who is online

Users browsing this forum: No registered users and 1 guest

cron