Parallella Community

by **DonQuichotte** » Mon May 09, 2016 9:11 am

(about my little project Paralle2)
I still believe Parallella is a wonderful development platform.
I was impressed to run a 10x10 Eternity 2 solver as fast as 50 Mn/s quite easily actually.
Some CPU cores don't go faster, the perf is now a good 50 % of one of my high-end CPU cores.
And porting C code, this was nothing special at first.

So I expected to write a 2nd tutorial...
I thought I had understood the e_read() / e_write()...
since I could use them in my Paralle2 github ;
but I desperately need one... too many issues currently.
And yes, I've read *many* docs and tested *my best*.

1°) compiler option '-msmall16'
2°) Epiphany compiler's limits
a) support for INTEGERS
b) slow bit routines
c) no automatic optimization of local offsets
d) limited NOP feature
***

1°) compiler option '-msmall16' breaks with everything I've tried.
There's a summary in the zipped 'bug2.7z' I add to my work github
(https://github.com/DonQuichotteComputers/paralle2) - hoping help !
-msmall16 is not compatible with the SDRAM / eCore library e_lib.h, or most certainly with my brain :/

2°) Epiphany compiler not optimizing as much as it could
a) not enough support for INTEGERS

(from SDK 2015.1 downloaded late March 2016, 7020 headless package, gcc 4.8.2)

Before diving into the assembly code, I wanted to do the best with C and compiler options.
This is how I learnt -mshort-calls, -msmall16, -m1reg-r63... all are must-have IMO.
Especially because - did I tell you - I'm only interested in integers.
And because Paralle2's speed gained one order of magnitude after boycotting the 'fast' format for the 'internal' format.
'internal' or 'infernal' we have to choose. Paralle2 will be 'internal' or won't be.

Well, all in all:
the compiler makes a pretty good job with LD/ST/ALU operations. Avoiding many stalls... nice work honestly.
And tools like the ESDK, objdump, gcc explorer, notzed's tools... very nice

It'd be very interesting though if the compiler could detect the 'NO FPU' usage.
I've never - never in a month - seen any generated IALU2 instruction into my code.
-> no automatic detection of 'no FPU code' :/
-> no change with the compiler option -mfp-mode=int ?
-> no change with forcing IALU2 through the ad hoc register O_O
-> not even an easy change of a 'ADD some_regs' to a 'IADD some_regs' for reducing the number of cycles o_O

So here we are, integer guys, condemned to a 50 % CPU yield :/
Or... coding in assembly. Not the way Adapteva will attract many people IMO ; a rock solid gcc or clang compiler, yes.

b) slow bit routines
I've read somewhere that gcc does not care about micro-benchmarks, just macro-benchmarks.
They focus on general performance boost, not some negligible local optimizations like, say, bit routines.
Unless there is a high demand for a known program that would be full of bit routines (Paralle2 !? ^^), things won't change.
Particularly true with the x86 'BT' instruction, they don't really care, for more than 10 years now.
Fortunately clang behaves well in most cases and both are free tools

Just a matter of fact ; gcc is so much multi-platform, it cannot optimize everything, every time.
Do you know the Xeon Phi added 5,000 new mnemonics... OK I shut up.

All that, to say that I've rewritten __builtin_popcount and __builtin_ctz.
The original ctz was something like, iirc, 38 instructions with one branch after the 19th... waow.
Added to my work Paralle2 github, hopefully nobody brings a process against me lol -- I quote my sources.

c) no automatic optimization of local offsets
There is a lot of useless 'movt rX, 0x0' when loading local variable offsets, even with -Ofast compilation option.
OK there is this -msmall16 option but it gives me headaches, see 1°)
Couldn't it be an automatic option instead ?
Since we work with 'internal',
since the previous 'mov' crushes the whole 32-bit - from docs and personal tests -,
this frequent 'movT rX, 0x0' is useless, a waste of space and time.
Even a forced "data_bank3" section location does not avoid this unoptimized behavior ; mov r0, 0x6000 then movt r0, 0x0.
Natively you can find 16 % of those dummy instructions.

d) limited NOP feature
I come from an x86 world where every NOP is automatically assigned any useful length - great work indeed ; iirc it works even with a 11-byte length.
I see the compiler only uses 2-byte NOP, which is bad especially for the tiny hardware loop... OK, OK, assembly may do the work, I agree... partially agree.
An assembly macro may do the job... as for the compiler it will have to deal with the cc clobber condition... I should give it a try...

Did anybody try to compile a gcc 5 or 6.1 ? Any change ?
... Paralle2 = 69 Mn/s anyway, in progress

by **Olaf** » Mon May 09, 2016 9:03 pm

For the record, I am completely new to Parallella, even Linux and Git so a steep learning curve.

I am really impressed in your post, very clear explanation for improvement, but I am wondering if improving the C language for better optimizing is going to really speed up the processing.
I had a Image processing background and I did prove again and again to my boss back then that I could develop processing faster without the usage of the Intel C++ compiler because I tried to understand the problem. So my coding was specially designed for the problem I had to solve. I didn't rely on the compiler to save my day.

I did create a basic c++ function that I then used as a base reference to measure any new performance for new variations.

Back in those days I learned e.g. that moving a byte in memory was slower than more processor instructions. So I created functions.

void MethodA() --> one calculation
void methodB() --> second calculation
void MethodAB() --> Both calculations merged as one single function preventing to have the CPU load the data yet again.

The other thing I learned was that a jump is costly so I always made sure that what "rarely" got called was executed in the "else" part.

The thing is, we have here Parallella. It is an unique design and we should not think inside the box of C compiler optimizations but out of the box by maybe creating small assembly blocks that can be executed in c code.

by **sebraa** » Mon May 09, 2016 9:19 pm

I don't know what time you are talking about, but modern C compilers are usually able to automatically "inline" (merge) functions and to avoid using branches if possible. However, this requires correctly annotated source code (e.g. marking functions "static"). In most cases, you shouldn't need to go down to assembly language level; however, exceptions apply. Both the ARM and Epiphany chips are somewhat designed for C and other higher-level languages, in contrast to other architectures, which were once designed for assembly programmers.

For most applications, compilers outsmart humans (in average). High-performance computing or strange architecture with not-as-mature compiler support are exceptions, though.

by **Olaf** » Mon May 09, 2016 9:43 pm

by **aolofsson** » Wed May 11, 2016 1:56 am

by **DonQuichotte** » Wed May 11, 2016 8:46 am

seems I smashed one open door with my movt rX, 0x0. Opened by jar since December 2015, so I'm a bit late ^^

Thanks for the reply, Andreas ; the sdk issue link, I had not thought of it, is the convenient place

OK probably my fault with the IALU2 mode, generally I use UNsigned int :/
Never thought UNsigned int would be inappropriate ; OK pour signed int, I'll test all this stuff again.

Don't throw away -msmall16 too soon, please ; we need small offsets and it works locally... but crashes with the standard e_read / e_write... but I'll try signed int everywhere first and continue investigating it.
From what I currently understand the r0/r1 parameters become junky with the -msmall16.

I agree the NOP is a detail ; I guess it's not a problem making a macro that would detect if a 2-byte or a 4-byte NOP is needed.

(another issue
I downloaded the headless 2016.3 Zynq 7020 image last night but unfortunately the old method I used (SSH/putty, a simple Ethernet cable, static IP, Suzanne Matthews tutorial, mine in Paralle2) is broken

Reformatting the SD card with the old image and using my method again is fine.
I'll have to find out what is the difference with the 2015/2016 network config.)

Many thanks again, I can go on

by **DonQuichotte** » Thu May 12, 2016 8:52 am

OK, isolated the bug - or the misunderstanding -, declared an SDK issue.
Basically the option -msmall16 is not compatible with the e_read() call of the 2015.1 SDK, especially with e_emem_config and dest offset parameters.
I think I must look at the gcc's source code to understand how this option is handled.

And I confirm for jar at least - though not fully tested and apart from this issue - -msmall is a great option for removing the superfluous movt rX, 0x0

No unsigned helps

sharing thoughts too ; gcc-explorer is a great tool as well - the guy should get a medal for it.
Besides I make him "Knight Commander Compiler Explorer", he deserves it 8-)

by **DonQuichotte** » Wed May 18, 2016 12:33 pm

1°) I've been fooled by the e-objdump output
I could not see any "imul" or "i<something>" instruction... because all are encoded as "fmul" or "f<something>".
Only a perfect parser - conscious of the current used FP mode - could output what I naively expected, the "imul" and so on.
It's probably too much work, I can understand that.

2°) the -fp-mode=int actually works
It improves my project about a little 6 % - the compiler hardly uses the 2nd pipeline.
Though I have at least 3 or 4 tricks to force it to use it, instead of the 1st pipeline :'(

3°) the -msmall16 is definitely a problem
I could replace the e_read(e_emem_config, ...) with a hardcoded e_read(0x50, ...)
but there are other problems after this successfully tweaked e_read() :/
Together with the useless movt rX, 0x0 and the poor performance of the 2nd pipeline with integer mode,
I am afraid I must code in assembly from now on.

The first tests encourage me to go on, as the assembly output is way shorter than the e-gcc's output

A few thoughts:
- I continue to believe Adapteva should take care of the C compiler's quality, it's robust and stable OK but not optimized for the int mode and local variable offsets.
From the languages used in the projects, I deduce most guys in here hope something with cheap FPGA and the great Epiphany de la Parallella is despised :'( Tell me I'm wrong
- updating e-gcc seems too hard for me ; it's not trivial ; my best regards for those who write compilers
- coding in assembly is hard - 10 times less productive than C if not more, I would say ; easy to forget a stack save or restore, to be fooled by an index or array size...
- if successful I'll write a tutorial for coding in Epiphany assembly ; unless it exists somewhere ; reading the provided example codes is long and I was maybe unlucky but I did not get much help from them until now
- as a RISC architecture there is a burden placed on the developers, OK fine... there is a need for a collection of assembly tricks but apart from the PAL does it exist something else ? Pure assembly ? The assembly forum seems dead :'(
- I've spent one hour - yes, 60 minutes - trying to code a helpful "align(8)" the best way (that would't do 3 NOP if it lacks 6 bytes, 2 NOP if it lacks 4 bytes, 1 NOP if it lacks 2 bytes)
I've failed. I need a deep understanding of 'as'.
Or, better, let's dream: can you please allow us to force a 4-byte B<cond> instruction instead of the short 2-byte B<cond> ? It's too long to explain why. OK I'll make a request to the SDK url Andreas gave me

- I might go on to the assembly forum ; wide and wild space appealing ! let's go Sancho !

by **aolofsson** » Wed May 18, 2016 2:34 pm

by **DonQuichotte** » Wed May 18, 2016 3:17 pm

Yes I know .balignw 8, 0x1a2 almost by heart now.
0x1a2 is the code for nop.

From the 'as' doc, if you don't specify the 0x1a2 it will feed it with zeroes ; checked with gcc explorer too iirc.
And I see the 0x1a2 will be repeated again and again, until the number of bytes is filled.
So for balignw 8... that will be 3 NOP if, imagine, (PC modulo 8) = 2
So I wanted this: a macro detecting this case, filling with a 4-byte instruction then this NOP... 2 instructions instead of 3.
And a 4-byte NOP when (PC modulo 8) = 4... this is not automatic, you must fill the .balignw 8, <the_magic_4_bytes_NOP like for instance mov r63, r63>

But I might be wrong. I'll recheck all this.
fft is nice, already had a look, helped a bit, just a bit. But too heavy as a start, just my opinion. I understand nothing about FFT by the way.

I work on Paralle2, an Eternity II solver which is highly recursive, like Go, Othello or chess or any other minimax algorithm you can imagine.
Yes there's not much for feeding the 2nd pipeline for my purpose, I admit. The -m1reg-r63 is well implemented... but r63 is never used by e-gcc in my project.
But with assembly I can force it a little to use the 2nd pipeline instead of the 1st ; I intend to show how, hopefully next week

For instance when you increment a pointer, the add ptr, ptr, 1 could be isub ptr, ptr, r63. Simple.

Parallella Community

Requests for improving the C compiler

Requests for improving the C compiler

Re: Requests for improving the C compiler

Re: Requests for improving the C compiler

Re: Requests for improving the C compiler

Re: Requests for improving the C compiler

Re: Requests for improving the C compiler

Re: Requests for improving the C compiler

Re: Requests for improving the C compiler

Re: Requests for improving the C compiler

Re: Requests for improving the C compiler

Who is online