(about my little project Paralle2)
I still believe Parallella is a wonderful development platform.
I was impressed to run a 10x10 Eternity 2 solver as fast as 50 Mn/s quite easily actually.
Some CPU cores don't go faster, the perf is now a good 50 % of one of my high-end CPU cores.
And porting C code, this was nothing special at first.
So I expected to write a 2nd tutorial...
I thought I had understood the e_read() / e_write()...
since I could use them in my Paralle2 github ;
but I desperately need one... too many issues currently.
And yes, I've read *many* docs and tested *my best*.
1°) compiler option '-msmall16'
2°) Epiphany compiler's limits
a) support for INTEGERS
b) slow bit routines
c) no automatic optimization of local offsets
d) limited NOP feature
***
1°) compiler option '-msmall16' breaks with everything I've tried.
There's a summary in the zipped 'bug2.7z' I add to my work github
(https://github.com/DonQuichotteComputers/paralle2) - hoping help !
-msmall16 is not compatible with the SDRAM / eCore library e_lib.h, or most certainly with my brain :/
2°) Epiphany compiler not optimizing as much as it could
a) not enough support for INTEGERS
(from SDK 2015.1 downloaded late March 2016, 7020 headless package, gcc 4.8.2)
Before diving into the assembly code, I wanted to do the best with C and compiler options.
This is how I learnt -mshort-calls, -msmall16, -m1reg-r63... all are must-have IMO.
Especially because - did I tell you - I'm only interested in integers.
And because Paralle2's speed gained one order of magnitude after boycotting the 'fast' format for the 'internal' format.
'internal' or 'infernal' we have to choose. Paralle2 will be 'internal' or won't be.
Well, all in all:
the compiler makes a pretty good job with LD/ST/ALU operations. Avoiding many stalls... nice work honestly.
And tools like the ESDK, objdump, gcc explorer, notzed's tools... very nice
It'd be very interesting though if the compiler could detect the 'NO FPU' usage.
I've never - never in a month - seen any generated IALU2 instruction into my code.
-> no automatic detection of 'no FPU code' :/
-> no change with the compiler option -mfp-mode=int ?
-> no change with forcing IALU2 through the ad hoc register O_O
-> not even an easy change of a 'ADD some_regs' to a 'IADD some_regs' for reducing the number of cycles o_O
So here we are, integer guys, condemned to a 50 % CPU yield :/
Or... coding in assembly. Not the way Adapteva will attract many people IMO ; a rock solid gcc or clang compiler, yes.
b) slow bit routines
I've read somewhere that gcc does not care about micro-benchmarks, just macro-benchmarks.
They focus on general performance boost, not some negligible local optimizations like, say, bit routines.
Unless there is a high demand for a known program that would be full of bit routines (Paralle2 !? ^^), things won't change.
Particularly true with the x86 'BT' instruction, they don't really care, for more than 10 years now.
Fortunately clang behaves well in most cases and both are free tools
Just a matter of fact ; gcc is so much multi-platform, it cannot optimize everything, every time.
Do you know the Xeon Phi added 5,000 new mnemonics... OK I shut up.
All that, to say that I've rewritten __builtin_popcount and __builtin_ctz.
The original ctz was something like, iirc, 38 instructions with one branch after the 19th... waow.
Added to my work Paralle2 github, hopefully nobody brings a process against me lol -- I quote my sources.
c) no automatic optimization of local offsets
There is a lot of useless 'movt rX, 0x0' when loading local variable offsets, even with -Ofast compilation option.
OK there is this -msmall16 option but it gives me headaches, see 1°)
Couldn't it be an automatic option instead ?
Since we work with 'internal',
since the previous 'mov' crushes the whole 32-bit - from docs and personal tests -,
this frequent 'movT rX, 0x0' is useless, a waste of space and time.
Even a forced "data_bank3" section location does not avoid this unoptimized behavior ; mov r0, 0x6000 then movt r0, 0x0.
Natively you can find 16 % of those dummy instructions.
d) limited NOP feature
I come from an x86 world where every NOP is automatically assigned any useful length - great work indeed ; iirc it works even with a 11-byte length.
I see the compiler only uses 2-byte NOP, which is bad especially for the tiny hardware loop... OK, OK, assembly may do the work, I agree... partially agree.
An assembly macro may do the job... as for the compiler it will have to deal with the cc clobber condition... I should give it a try...
Did anybody try to compile a gcc 5 or 6.1 ? Any change ?
... Paralle2 = 69 Mn/s anyway, in progress