Parallella Community

by **jadeblaquiere** » Mon Feb 16, 2015 7:05 pm

So, back in the day I spent a few years developing signal processing algorithms for a rather similar architecture... the Rockwell SPPD processor module (Some detail available from ). The scales were of course somewhat larger but depending on how you wanted to package it, the SPPD was composed of 12-16 TMS320C30 DSPs, each with 32K words of local SRAM. Up to 4 processors shared a processor bus (P-Bus) and the P-Bus modules could talk to each other via the mezzanine bus (M-Bus) backplane. For our application space was a premium so this implementation was packaged into two multi chip modules (MCMs) which were mounted together in a single package (roughly 4" square).

There were several features of the platform which I am finding are very similar to the Epiphany CPU:
1. Local memory was very fast (no contention) and global memory (M-Bus) was very slow (high contention). Generally this meant you wanted to use double buffering of your input stream and use DMA to move data in the background while the CPU is doing the interesting stuff. Also, anything you could to to reduce your data set up front paid you back quite a bit in reducing the amount of data you need to move.
2. Local memory was limited. There was 8K bytes (2k words) on-CPU memory and local memory (32K bytes) on the C30 expansion bus but the smallest data type of the C30 is 32 bits, so it was not exceptionally efficient for processing lower precision input data. Then you split your memory up four ways to account for double buffering and input/output buffers and pretty quickly you realized you were never going to make the whole problem fit in local RAM and once you accept that you will find you can break your problem up so that you don't need to make it all fit at once. More RAM would be nice (sounds like this is potentially in the works for a future generation of Epiphany) but ultimately there will always be a limit.
3. The processors allow for a certain amount of internal parallelism but compilers often don't understand the algorithm behind the code to be able to make these kinds of optimizations (e.g. can implement a rotate as a SHR and an IMADD, but usually it will be written in your library as x >> y || x << 32-y). Ultimately you'll need to find the bottlenecks in your algorithm and understand how you can either give the compiler a hint or break out the assembly.
4. The compiler knows nothing about the composite "machine". This is likely to keep the entity between the keyboard and chair busy. In our case we frequently were presented with balancing throughput and latency. Doing things in parallel is usually faster (lower latency) but doing work in pipeline stages can result in more overall throughput. It just depends on the type of problem you're trying to solve. With 16 cores there were limited options but I suspect as core count gets large the design of pipelined parallel operations will get complex. With the Epiphany topology (essentially algorithm layout) is also very important as sending data next door is much cheaper than sending it halfway across the chip.

There were a few things about the C30/SPPD implementation which we found really useful (perhaps somebody might consider for future epiphany):
A. The C30 had a tiny (64 word) instruction cache... so you could link library routines to reside in global memory but as long as the code was tight enough it was still pretty fast. Even local routines benefitted from this as there were no program fetch operations getting in the way of data access.
B. There was intermediate memory (512K words) on the P-Bus which was accessible only to the processors on that bus. This memory was shared by 3-4 CPUs so 1/4 less contention than the global memory store but could hold longer data streams. Essentially this would be like having on-chip memory in the epiphany which was addressable only by that row. In our application we gave up one CPU per-row and got P-Bus memory chips instead.
C. The bus ASICs implemented a simple mailbox system and synchronization logic which enabled any processor to interrupt another in such a way that they knew where the message came from and also supported mutex implementation. There are surely ways to do this elegantly in software and I've been looking for a lightweight implementation for Epiphany but haven't come across such yet.

Of course the Epiphany CPU is 30+ times faster, 100+ times smaller and a zillion times more accessible. It's also 10 zillion times "funner". At least 10 zillion.

by **piotr5** » Mon Feb 16, 2015 10:36 pm

by **jadeblaquiere** » Tue Feb 17, 2015 1:28 am

by **piotr5** » Tue Feb 17, 2015 10:08 am

Parallella Community

Obscure comparison of the day : Rockwell SPPD

Obscure comparison of the day : Rockwell SPPD

Re: Obscure comparison of the day : Rockwell SPPD

Re: Obscure comparison of the day : Rockwell SPPD

Re: Obscure comparison of the day : Rockwell SPPD

Who is online