Parallella Community

by **mhonman** » Tue Sep 24, 2013 5:15 pm

This feels like something of a silly suggestion even as I write, but there are a couple of the annoyances in the present system architecture - host + epiphany co-processor.

Firstly, the host cannot participate on an equal basis - it has to busy-wait on communication with Epiphany cores, and cannot participate in mutex operations.
Secondly, there are some application activities (map/reduce) which, while they could be added to the workload of one of the cores, may result in an imbalance in the workload of the Epiphany cores. i.e. because this core has a bigger workload the others may hang around waiting for it.

So here's the slightly daft idea... In a 64-core Epiphany it would be handy to have an optional 65th core that can act as an agent of the host, and be available for coordination of the other cores. I guess more like a service processor than anything else - but it should have the same programming model and participate fully in the communication mesh. And of course it should be possible to disable the 5th wheel when more than one Epiphany chip is used.

by **over9000** » Tue Sep 24, 2013 6:13 pm

by **notzed** » Mon Sep 30, 2013 4:55 am

1/64 isn't much different to 1/65, so if you want a supervisor core you can still do it without losing much in the way of performance. I'm not sure if it's that useful either way though because there's still variable latency across the whole chip and any solutions i can think of which address that will work fairly well arm-side.

There are lots of ways to implement concurrent-safe code and imo a mutex is about the worst one for highly parallel workloads because any serialisation == no concurrency and (lots of) wasted flops. Job queues scale much better (which can be used to implement 'coroutines' in newspeak), and in a lot of cases you have deterministic processing time so static scheduling will be near optimal, but that obviously doesn't cover every case.

over9000: I came up with the same idea a couple of days ago and even coded it up, but haven't tried it yet. I will definitely work though. I started with the "local copy" version and I think this is absolutely necessary to avoid flooding the mesh network with useless read/write transactions. There's no reason it shouldn't work between ecore<>arm as well.

The only caveat is if you want to have n-to-m producer-to-consumers you need n*m queues and must iterate them on each end, and the producer determines the scheduling. At least it can make a decision where to send the work based on queue load so with shallow queues one should be able to get fairly decent dyamic scheduling.

Another complication is external DMA which might need to be serialised for best performance.

AFAICT the epiphany and the shared memory area is mmaped on the arm with no caching, so there are no coherency issues from the arm side (this has to be so as otherwise none of my code would be working). And e-hal does no cache flushing or anything in e_read/write either.

The arm can raise interrupts just by writing to the ILAT in the same way any other core raises them on another one. This is how programs are currently started. Until the fpga logic includes it, it's only polling on the arm side though. Unless you're running microsecond-long tasks on the ecores a sleep/poll loop isn't that bad (and if they were that short, may as well run them on the arm).

Don't forget that one never gets 100% efficiency in any design. Just because it's explicitly visible with epiphany doesn't necessarily mean it's worse.

There's a broader issue here too of how to handle multiple workloads, but that probably deserves it's own thread.

by **stealthpaladin** » Mon Sep 30, 2013 12:17 pm

Hi; I haven't got a board yet either but I thought I'd add my findings from design where I'm having some (hopefully useful) thoughts.

This may not work if you dont have a small enough data package; but for my solutions I've been aligning my queues with streams of 4KB chunks, subdivided into 1KB chunks.
These can 'unroll' onto the core stacks based in a predictable manner. This gives me more of a frame-rate instead of ghz and flops; which at least in design has helped make things easy enough to keep parts moving.

The downside to this approach is if you are switching data/functionality alot, probably not so great. Once you 'thread' the stream through once, you can operate at high efficiency as long as you keep going. The beginning and end of a stream are a bit of a different story; I can go ahead and barrier there in my situation since I'm processing similar data for long periods at a time.

by **mhonman** » Wed Oct 02, 2013 9:08 pm

notzed: I must agree - personally I really dislike any kind of global synchronisation point but occasionally they are necessary & because they are so expensive I'd like them to be over with as quickly as possible.

The sort of application I'm thinking of is a typical grid-decomposition problem, where a 2D domain can be mapped onto the Epiphany mesh - in which case it becomes inconvenient to map the problem onto all-but-one of the processors. Chunks of the data that are logical neighbours cannot be all be mapped to neighbouring cores, in which case there are some multi-hop communication paths and those may create bubbles of idleness for the rest of the mesh (depending on how much buffering can be done).

As things stand it looks like the ARM cores are something of a poor relation and interaction between the host and Epiphany is pretty much guaranteed to slow the latter down.

There is an associated question: assuming the shared DRAM is not dual-ported, any polling of locations in this storage is going to reduce the DRAM bandwidth available to the Epiphany?

Combining the ideas posted by over9000 and notzed, it looks liks a good way of arranging message passing between host and cores is to use the circular buffers in DRAM, and where the core must wait for the host, use either the ILAT write or have the core poll a local memory word to which the ARM writes to signal that it has reached the synchronisation point.

Time to go try some things... thanks!

by **Gravis** » Thu Oct 03, 2013 2:11 pm

it's my understanding that the purpose of for the zynq on the parallella is to simply act as an interface to read and write to the epiphany and was not intended to be an active processing participant.

frankly, i wish the epiphany was on a PCIe card so that it would be easier to interface and be slightly less expensive. maybe even have the epiphany chips on modules so that you could buy modules to link together several chips to get more cores. though it would be nice to have to option to have a very long board that has lots of epiphany chips right on it (32 chips!).

dare to dream

by **LamsonNguyen** » Fri Oct 04, 2013 12:18 am

That's one sweet dream, I would love to have that.

by **over9000** » Fri Oct 04, 2013 12:46 am

by **notzed** » Wed Oct 09, 2013 2:51 am

by **petr_cvek** » Fri Oct 11, 2013 12:51 am

Parallella Community

Spare wheel (umm core)

Spare wheel (umm core)

Re: Spare wheel (umm core)

Re: Spare wheel (umm core)

Re: Spare wheel (umm core)

Re: Spare wheel (umm core)

Re: Spare wheel (umm core)

Re: Spare wheel (umm core)

Re: Spare wheel (umm core)

Re: Spare wheel (umm core)

Re: Spare wheel (umm core)

Who is online