Requests for improving the C compiler

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Re: Requests for improving the C compiler

Postby Olaf » Wed May 18, 2016 9:01 pm

DonQuichotte wrote:- if successful I'll write a tutorial for coding in Epiphany assembly

I guarantee you that this will make me very happy.

Writing in assembler has become a lost art, but back in my days, I did write code in assembler natively and when done right you basically build your own language from scratch.

10 years ago when I did image processing, I learned to to modify my C++ code in such a way that it produced even faster code.
I am not familiar with RISC processors, but in C++ the first 4 parameter you send to a C function was stored in 4 registers, the fifth ended up on the stack. Something on the stack loses time because it hat to go to RAM and be read back later.
The trick was to avoid having functions that needed more than 4 parameters when they are intended for speed.

Over time I did not have to resort to assembler anymore to get the most optimal speed. I learned how to write the C methods in succeeded a way that they always created the most optimal assembler output.

One trick I learned was that you can help the compiler in understanding what you are trying to do by creating additional variables.
The key here is that when "YOU" decide that this must be this type of pointer, the the compiler followed your guideline instead of guessing.
That additional temporary variable gives a lot of developers the wrong impression that it did more work, while in reality the compiler would have cerated that temporary variable anyway but worse optimized since it did choose a more generic approach.

When I find more time, ever maybe I will try to explain that you can build faster C code yourself instead of relying to compiler optimizations.
I recall that when I had a for-next into a for-next then I created it in such a way that there was not second for-next but replaced by a pointer that I manipulated myself.
That pointer manipulate prevented again for the compiler to create temporary stack values.

I also had special methods that were designed for special cases like when it was dividable by 2. If it was not dividable by 2 then I called the original (slower) function.

I have of course not yet experience with Parallella development and even Linux development. It all depends if I can find time to toy with it.
Posts: 37
Joined: Sun May 08, 2016 8:47 pm

Re: Requests for improving the C compiler

Postby DonQuichotte » Thu May 19, 2016 7:33 am

Thanks Olaf

Yes I myself do some task instead of letting the C compiler guess the whole thing, it may help.
Great help with aligned structures on powers of 2... etc.

Here is my answer to the "perfectly code aligned 8-byte" :D - the long version, one heavily commented C file lol

#include "e-lib.h" // mandatory even for a minimalist design -- e_get_coreid(), e_read(), e_write()

* DonQuichotteComputers (at) gmail (dot) com: 2016/05/19 testing if we need a better code align(8) macro
* There are 4 cases for align(8): either we need 0, 2, 4 or 6 bytes
* Tested with SDK 2015.1 on a Fedora Core 23
* Compile with:
* e-gcc -T ${ELDF} -O0 src/testalign.c -o testalign.elf -le-lib
* with *your* ELDF path, something like that:
* ELDF=/home/ylav/dev/parallella/buildroot/esdk.2015.1/bsps/current/internal.ldf
* Trace with:
* e-run -t testalign.elf


void finalsolution(void);

int main(void) {
// e_coreid_t coreid;
int row, col, cmdI;
int fn1, fn2;

#define ALIGN(x) __attribute__ ((aligned (x)))

int ALIGN(8) var=3;
asm volatile (".balignw 8, 0x01a2"); // (1) 0x1a2 is the 2-byte NOP, theoretically e-as pads 8 bytes with those 2-byte sequences
asm volatile (".balignw 8, 0x01a2"); // (2) 0 byte => lack of 0 bytes => no issue, e-as handles this
asm volatile ("gid"); // (3) 2 bytes => lack of 6 bytes => 3 NOP
asm volatile (".balignw 8, 0x1a2"); //
asm volatile ("mov r41, #3");
asm volatile (".balignw 8, 0x01a2"); // (4) 4 bytes => lack of 4 bytes
asm volatile ("gie");
asm volatile ("add r41, r41, #4"); // (5) 6 bytes => lack of 2 bytes
asm volatile (".balignw 8, 0x01a2");
asm volatile ("gid");
asm volatile (".balignw 8"); // (6) no 2nd parameter ? the 'as' doc says if pads with '0' ; what does e-as do ?
asm volatile ("mov r62, #0xDEAF");

asm volatile ("mov r63, r63");

return var;

int testalign4(void) {
int ALIGN(8) var=3;
// we'll try fcef fc02 = mov r63,r63 => no updated flag, no change, minimal one-instruction penalty
// bad syntax // asm volatile (".balign 8, 0xfceffc02");
// bad syntax // asm volatile (".p2alignl 3, 0xfcef, 0xfc02");
// bad syntax // asm volatile (".p2alignl 3, 0xfcef 0xfc02");
// bad syntax // asm volatile (".p2alignl 3, 0xfceffc02");
// bad syntax // asm volatile (".p2alignl 3, 0xfceffc02UL");
// grep parallella_examples with p2align... nothing
// lots of imagination, huh ? ... bad syntax though // asm volatile (".p2alignl 3, .byte 0xfc, .byte 0xef, .byte 0xfc, .byte 0x02");
// what I "love" with the 'as' documentation is the lack of a simple example for .p2alignl: it would have been too simple. Trial and error is much more 'fun' probably ^^
// or maybe they work for google... since we have to find the syntax somewhere... my bad, I have no internet at home ; I am a knight errant...
// grep parallella_examples with align... too much results ; some .align before function names... not my concern ; some mysterious ".balign 4" ?!
// hey, this .balign is also documented in the 'as' documentation ! ... but it's just a copy/paste of the .p2align, no .balignl explained :'(
// grep parallella_examples with balign... .balign 4, no explanation ; and .balignw 8,0x01a2 exclusively... nothing new

// O_O bingo ! asm volatile (".balignl 8, 0xfc02fcef"); it is a 4-byte NOP lol at last I can go to bed !

asm volatile (".p2alignl 3"); // it complains but at least compiles :P
asm volatile ("mov r62, #0xABC");
asm volatile (".p2alignl 3"); // expecting 4 '0'
asm volatile ("mov r62, #0xDEF");
asm volatile (".balignl 8, 0xfc02fcef");
/* yes, as expected:
86e: 01a2 nop
870: d78b e0a2 mov r62,0xabc
874: 0000 beq 874 <_testalign4+0x14>
876: 0000 beq 876 <_testalign4+0x16>
878: ddeb e0d2 mov r62,0xdef
87c: fcef fc02 mov r63,r63 // yes ! my beloved 4-byte NOP !

asm volatile ("mov r62, #0xBED"); // go to bed... I deserved it
// OK... the 'patron de remplissage' may be longer than 4 bytes, good news :) asm volatile (".balignl 8, 0xfc02fcef01a201a2");

// OK... eureka... 2 steps will do the task: first, 4-byte align ; second, 8-byte align
// lack 0 => 1st step = nothing, 2nd step = nothing 0 op, optimal
// lack 2 => 1st step = NOP, 2nd step = nothing 1 op, optimal
// lack 4 => 1st step = nothing, 2nd step = mov r63, r63 1 op, optimal
// lack 6 => 1st step = NOP, 2nd step = mov r63, r63 2 op, optimal


return var;

// we test the expected solution to a perfectly optimized 8-byte code alignment
#define PERFECT_ALIGN8 asm volatile (".balignw 4, 0x01a2"); asm volatile (".balignl 8, 0xfc02fcef");

void finalsolution(void) {
asm("mov r62,r62");
asm("mov r61,r61");
asm("B 0xBED");

00000800 <_main>:
800: 775c 2700 str fp,[sp],-0x6
804: 74ef 2402 mov fp,sp
808: 0063 mov r0,0x3
80a: 0e5c 0400 str r0,[fp,+0x4]
80e: 01a2 nop (1) OK, align 8, 1 NOP as expected
810: 0392 gid (2) OK, 'as' is smart, nothing done as expected since we are already 8-byte aligned
812: 01a2 nop (3) OK, 3 NOP :/ suboptimal, don't you think so ?
814: 01a2 nop
816: 01a2 nop
818: 206b a002 mov r41,0x3
81c: 01a2 nop (4) OK, 2 NOP
81e: 01a2 nop
820: 0192 gie
822: 261b b400 add r41,r41,4
826: 01a2 nop (5) OK, 1 NOP
828: 0392 gid
82a: 01a2 nop
82c: 0000 beq 82c <_main+0x2c> (6) e-as complains there is a lack of a 'patron de remplissage', nice French translation by the way :D Yes, these are zeroes, "BEQ Program Counter"...
82e: 0000 beq 82e <_main+0x2e> This situation is well handled: beq <current_program_position> jumps to <current_program_position + 2> as e-run will confirm it
830: d5eb ede2 mov r62,0xdeaf
834: 0e4c 0400 ldr r0,[fp,+0x4]
838: 774c 2400 ldr fp,[sp,+0x6]
83c: b41b 2403 add sp,sp,24
840: 194f 0402 rts
844: 0000 beq 844 <_main+0x44>

/* e-run -t testalign.elf
0x000800 --- _main str fp,[sp],-0x6 - memaddr <- 0x7ff0, memory <- 0x0, registers <- 0x7fd8
0x000804 --- _main mov fp,sp - registers <- 0x7fd8
0x000808 --- _main mov.b r0,0x3 - registers <- 0x3
0x00080a --- _main str r0,[fp,+0x4] - memaddr <- 0x7fe8, memory <- 0x3
0x00080e --- _main nop -
0x000810 --- _main gid - gidisablebit <- 0x1
0x000812 --- _main nop -
0x000814 --- _main nop -
0x000816 --- _main nop -
0x000818 --- _main mov.l r41,0x3 - registers <- 0x3
0x00081c --- _main nop -
0x00081e --- _main nop -
0x000820 --- _main gie - gidisablebit <- 0x0
0x000822 --- _main add.l r41,r41,4 - cbit <- 0x0, vbit <- 0x0, vsbit <- 0x0, registers <- 0x7, zbit <- 0x0, nbit <- 0x0
0x000826 --- _main nop -
0x000828 --- _main gid - gidisablebit <- 0x1
0x00082a --- _main nop -
0x00082c --- _main beq.s 0x000000000000082c -
0x00082e --- _main beq.s 0x000000000000082e -
0x000830 --- _main mov.l r62,0xdeaf - registers <- 0xdeaf
0x000834 --- _main ldr r0,[fp,+0x4] - memaddr <- 0x7fe8, registers <- 0x3
0x000838 --- _main ldr fp,[sp,+0x6] - memaddr <- 0x7ff0, registers <- 0x0
0x00083c --- _main add.l sp,sp,24 - cbit <- 0x0, vbit <- 0x0, vsbit <- 0x0, registers <- 0x7ff0, zbit <- 0x0, nbit <- 0x0
0x000840 --- _main jr lr - pc <- 0x6d8

000008ac <_finalsolution>:
8ac: 765c 2700 str fp,[sp],-0x4
8b0: 74ef 2402 mov fp,sp
8b4: fcef fc02 mov r63,r63
8b8: 0392 gid
8ba: 01a2 nop
8bc: fcef fc02 mov r63,r63 // 6 bytes, 2 op => success
8c0: d8ef fc02 mov r62,r62
8c4: fcef fc02 mov r63,r63 // 4 bytes, 1 op => success
8c8: 0192 gie
8ca: b4ef fc02 mov r61,r61
8ce: 01a2 nop // 2 bytes, 1 op => success
8d0: f6e8 0005 b 14bc <__HALF_BANK_SIZE_+0x4bc> // 0 byte, 0 op => success
8d4: 764c 2400 ldr fp,[sp,+0x4]
8d8: b41b 2402 add sp,sp,16
8dc: 194f 0402 rts

* my conclusion ? Yes, I wanted a better management of the 8-byte code alignment for C under Epiphany.
* I come from an x86 background where gcc and other compilers handle this issue perfectly well.
* I did not find anything on this subject in the parallella examples or on the forum.
* From now on, I can use PERFECT_ALIGN8 for my needs and that's my small gift to the Parallella community.
* We'll talk about Don Quichotte's exploits for centuries, for sure ;)
* Last words...
* Some will say it's a "nearly perfect" solution since we can have RAW or WAW sequences with updating r63, preventing dual issues.
* Nothing prevents you from choosing an unused register, or have a second macro with another register - fp, r28, lr... - to avoid these improbable situations.
* You can even write a macro with the register of your choice as parameter... I consider this as trivial and my problem as solved :)

* Now my next challenge will be a macro for automagically forcing a 4-byte B<cond> instead of the standard 2-byte B<cond>, with -Ofast as usual.
* It should be a decisive path for exclusively producing 32-bit instructions, don't ask me why. I guess this challenge will be easier this time :P
User avatar
Posts: 46
Joined: Fri Apr 29, 2016 9:58 pm

Re: Requests for improving the C compiler

Postby Olaf » Sat May 21, 2016 12:36 am

I am not familiar with RISC and RISC assembly yet but I am learning it. So I do not understand your code completely :-)
But I see that you also use the tricks that I used to get faster than the Intel C++ compiler (Back in 2001--2006) and their special image processing library.

Back in my x86 image processing I also heard about byte, word, dword, .... alignment to prevent stalls loading from memory.
But back in those days I could speed up with with CPU 10% by aligning along the cache lines that were 32 bytes aligned.
Back in those days I was literately hitting the boundaries of the CPU caches and I had to bypass them to be better than the competition ;-)

Now if you load a single dimensional data stream then, then one compiler alignment would be ok.
However in a 2 dimension data stream like in a image that is not dividable by 32 bytes size, would use the CPU cache less efficiently.
I did increase image processing speed by making sure that every image line started at a 32 byte boundary.

Example x = fill byte to make sure that "A" gets on a 32 byte boundary that matches the CPU cache line.

Source 2D image:

Pre-processed 2D image ready to have processing on

In memory this 2D image would be represented like this (single dimension)

Of course the code became more complex but I optimized the the data transfer from RAM to CPU cache.
No C++ compiler optimization was that smart enough to give that optimized result.

I did not come to these conclusions by logically deducing them, I did it by measuring them.
Posts: 37
Joined: Sun May 08, 2016 8:47 pm

Re: Requests for improving the C compiler

Postby DonQuichotte » Sat May 21, 2016 11:30 pm

What kind of competition ? did you code for video games or graphics ?

Anyway. I started new topics "Assembly class" and "Assembly snippets" in the "Assembly" forum ; you're welcome Olaf :)
User avatar
Posts: 46
Joined: Fri Apr 29, 2016 9:58 pm

Re: Requests for improving the C compiler

Postby Olaf » Sun May 22, 2016 12:01 pm

DonQuichotte wrote:What kind of competition ? did you code for video games or graphics ?

Anyway. I started new topics "Assembly class" and "Assembly snippets" in the "Assembly" forum ; you're welcome Olaf :)

I developed image (monochrome) processing code for scientific research :-)
Best times of my life, but that ended 10 years ago.

Designing software that was faster than the competition to process scientific results made the difference between selling the scientific equipment or not.

I really love assembler. It is not harder to program than C code for me. But RISC processors are for me completely new.
Posts: 37
Joined: Sun May 08, 2016 8:47 pm


Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 10 guests