Parallella Community

Posted: **Tue Mar 07, 2017 3:14 am**

how 'big' could you make individual e-cores before the design ethos is wasted,

or to put another way,

where is the 'point of diminishing returns' in adding transistors to a single core, r.e. Throughput.
(e.g. how much superscalarity, deep pipelining etc can you add before adding more cores would have been better for overall throughput)

(I seem to remember it's already a dual-issue design)

I note the epiphany 5 has a 64bit registers hence has gone to SIMD for 32bit throughput, (and I dont know, but would guess 'deep-learning extensions' might involve 16,8bit packed datatypes too..).
Was this a sweetspot - it doesn't place extra program complexity on double-precision scientific code, whist the SIMD idea might be unavoidable for maximum efficiency with low precision datatypes (for video/AI); was there ever any consideration of moving to a CELL like setup with 128bit registers (which might have scared people off r.e. the dual complexity of many-core *and* simd).. I know packed SIMD is largely seen as inflexible.
there have been machines with 128bit registers and component-oriented ISAs (e.g. dot & cross-product instructions, broadcast/swizzles on multiplies. CELL itself was a bit crazy in *only* having 128bit load/stores and in some pathological cases people were advised to pad smaller types up

Posted: **Tue Mar 07, 2017 5:38 am**

The architecture tradeoffs are often application-dependent. The 64-bit operations on E3/E4 use two 32-bit registers. For example, the double-word load/store accepts an even-numbered register and implicitly uses the subsequent odd-numbered register. One could imagine the 64-bit operations on E5 to behave similarly and that the circuitry for two single precision FPUs could be used for 64-bit operations. That wouldn't be packed SIMD, but in one way it does make full utilization of the single-precision FPUs more challenging since you can't overlap memory operations in the second pipeline because it's being used by second FPU.

Personally, I hate writing code for packed SIMD.

Posted: **Tue Mar 07, 2017 10:31 pm**

oh, register-pair approach .. interesting ( i had been assuming the 64bit chip just had 64bit registers like other risks)

>> Personally, I hate writing code for packed SIMD.

sure I'm aware how awkward it can be. I can see why the GPU vector idea took over (in the mainstream)

>>but in one way it does make full utilization of the single-precision FPUs more challenging since you can't overlap memory operations in the second pipeline because it's being used by second FPU.

interesting - if it was just the ability to dual issue 2 FPU operations (and indeed that would be down the road I was inquiring about.. could future iterations just extend the amount of superscalar issue), but I seemed to remember reading that the E5 uses SIMD for maximum float32 throughput-

https://www.parallella.org/docs/e5_1024core_soc.pdf
• SIMD 32-bit IEEE floating point support

.. thats why I assumed it might be 64bit registers (i can see it might be register pairs though..)

Posted: **Tue Mar 07, 2017 10:38 pm**

You could be right about SIMD. If there are indeed SIMD instructions, you could still dual-issue 32-bit FPU instructions and overlap with a memory or IALU operation. Not sure.

Parallella Community

e-cores cores, complexity tradeoff

e-cores cores, complexity tradeoff

Re: e-cores cores, complexity tradeoff

Re: e-cores cores, complexity tradeoff

Re: e-cores cores, complexity tradeoff