Assembly class HERE

Assembly class HERE

Postby DonQuichotte » Sat May 21, 2016 10:26 pm


I'm opening this topic for teaching Epiphany assembly - interfaced to C - to whoever wants.
Teaching and being teached: I am a beginner too, and there are lots of questions I still cannot answer.
I think teaching will help me reinforce my learning and may help a few others, hopefully.

I intend to start next week.
The subject is "writing recursive functions in assembly".
:o OK I know... some of you already whisper, that's two hard problems...
I agree, :roll: you'd better NOT begin with this topic :mrgreen:
As Andreas said, and I for the most part agree: there are lots of high quality assembly code in the Parallella examples.
I particularly appreciated e_fft_asm.S... so have a nice reading :) and you're welcome any time.

Why or when should we write in assembly ?
Apart from speed concerns: no way, it's not productive, not portable, not readable, etc.
Speed... when I bought the Parallella I had just speed (per watt) as expectation.
So if I can cut down the execution time by 10 %, I'm in.
The fact is, the assembly version we'll see next week is about 25 % smaller than the C compiler's output.

And I am quite convinced everybody should "think assembly" at least once to improve its C source code:
looking at the C compiler's output helps you reorganize data more efficiently or find out where it is not optimal yet (it's a robust but young compiler).
Look at __builtin_ctz output for example, or think about a 64-bit addition.

That's why I'm opening a second topic, let's call it "Assembly snippets".
It may give hints for the compiler guys, or we can simply integrate these snippets into our own source files.

Homework ;) for next week:
- read the docs if not done - especially the instruction set in the architecture reference on
Recommended: install esdk 2015.1 on your regular computer if possible (Linux x86-64 esdk for example) - even if it's several hours to get it,
you'll get nice tools to debug things faster than on the Parallella itself.
The esdk doc is fairly good.
User avatar
Posts: 46
Joined: Fri Apr 29, 2016 9:58 pm

Re: Assembly class HERE

Postby Olaf » Sun May 22, 2016 12:41 pm

I think it would be interesting to have an explanation about the RISC used in Parallella.
I have no background in RISC

Parallella Desktop model
So it has a Zynq SOC (FPGA + ARM A9)
And a Epiphany III processor.

This ARM A9 is a Cortex-A9 Processor:
The ARM A9 is basically a ARMv7-A 32-bit that has more added to it.

The Epiphany III is a second co processor.
It is also a RISC architecture but it has a different assembler instruction set than the ARM A9.

The ARM A9 (2 cores version) in Parallella is used to power Linux.
Linux does not use Epiphany III.
So it is the job of the developer, to create custom code for the Epiphany III RSIC to be uploaded and executed to each Epiphany core.
Posts: 37
Joined: Sun May 08, 2016 8:47 pm

Re: Assembly class HERE

Postby ImperialTurbineSaint » Sun May 22, 2016 11:24 pm

Just got my Parallella running today. And I am definitely interested in the assembly aspect of things; what I would love to be able to do is make a run time x86/64 bit assembly to epiphany assembly so foreign programs could be used. (I.E. Run anything that can fit and run.)

Edit: Sorry for the confusion DonQuichotte, writing isn't my greatest attribute and I can often make simple perspective mistakes. So let me try this again. I want to take code that was written for Windows computer; take that and turn it into an epiphany equivalent. I.E. Take a program, re-assemble it for use in epiphany. (Independent of its original platform. E.G. Take Tetris made on a Windows computer, and convert it for running on the Parallella independent from windows.)
Last edited by ImperialTurbineSaint on Mon May 23, 2016 5:39 pm, edited 1 time in total.
Posts: 2
Joined: Wed Dec 12, 2012 8:14 am

Re: Assembly class HERE

Postby DonQuichotte » Mon May 23, 2016 11:04 am

Thanks Olaf for this nice overview :D

Hi imperialturbinesaint :)
I'm not sure I understood your wish.
For sure, Parallella cannot be seen as a coprocessor, it's rather a "SBC", a Single Board Computer.
Between your regular CPU and the Parallella there is a gap ; the Porcupine may help but it's another topic.
Maybe you talked about the ARM... please explain this again, I don't understand.

That said. I've changed my mind, I'm explaining the LDR/STR instructions before anything else.
They are an important basis of Epiphany assembly.


(Apart from DMA) You cannot do any operation directly in memory:
everything must be done inside the r0-r63 registers.
r0 to r63, this is your 32-bit "work registers" basically.
That's good news for the x86 guys: it's much simpler than the AH/AL/AX/EAX/RAX/BH/.../R15/etc etc :)

Imagine you want to increment a value in memory, here is an example:

Code: Select all
ldr r0, [ mydata ] // LoaD mydata into Register r0
add r0, r0, #1     // better place a '#' on each immediate value 
str r0, [ mydata ] // STore Register r0 to mydata

Did you notice - if like me you come from an x86 background - the general syntax ?
Globally it's the Intel notation, it's nice because most people prefer it to AT&T:
Code: Select all
  <operation> <unique_dest> <src>

Here, ldr/str is the only notable exception of the whole instruction set, I think.
An x86 guy would like to
Code: Select all
  ldr r0, [ mydata ] // OK
  str [ mydata ], r0 // KO

Maybe it comes from ARM or is made to look like ARM...
maybe it was made to emphasize on memory operations that are "special" ?!
well I don't like this syntax exception but we have to live with it - it's a detail anyway.

Now let's have an overview of the 5 ldr sauces that we can find:

Code: Select all
// basic load from memory to regA
ldr regA, [regB]

// load with indexing or displacement
ldr regA, [regB, regIdx]
ldr regA, [regB, #(-)IMMx]

// load with post-modify
ldr regA, [regB], regPostModify
ldr regA, [regB], #(-)IMMx

From this you immediately see you can read/write up to 3 registers:
- regA is the mandatory dest (destination)
- regB is a mandatory src (source) and it is possible to update it with the post-modify syntax
- regIdx may complement regB, it's an index register
- regPostModify will update regB *after* the normal load instruction - hey it's called "PostModify"...
Here is an example:
Code: Select all
  ldr r0, [ r1 ], #-1
  // does the same as those 2 operations:
  ldr r0, [ r1 ]  // x86 taste: mov eax, dword[ebx]
  sub r1, r1, #-4 //            sub ebx, 4

It's time to talk about the operand size.
Normally if you were not falling asleep, you should have been shocked by the previous example, "r1=r1-4;" other said.
Yes indeed, a common mistake for beginners like us may come from the different behaviour with IMMx operands.
<!> When handling registers, their true value is taken, nothing more.
<!> When handling IMMx values, their true value is multiplied by the operand's byte size.

Code: Select all
  ldr r0, [ r1 ], #-1 // r0=[r1]; r1-=4;

is different from
Code: Select all
  mov r2, #1
  ldr r0, [ r1 ], r2  // r0=[r1]; /* bug is coming ! */ r1--;

In the latter case, another ldr r0, [ r1 ], r2 will throw a misalignment exception
- since Epiphany expects a 4-byte alignment for loading (and storing) 4-byte values.

:twisted: Hey ! Where is this operand size, I cannot see anything ! ldr loads 32-bit values only !
:) correct.
    ldrb -> 1-byte load
    ldrs -> 2-byte load
    ldrw -> 4-byte load
    ldrd -> 8-byte load
ldr is 32-bit by default, B/S/W/D are the suffixes we'll see again with str family.

:cry: Hey ! What's this damn IMMx you talk about ?
:) (skipping an index detail of implementation, it's basically a...) ...11-bit unsigned IMMediate, for indexing or post-modify. Remember you can sign it with "-"

:o Hey ! There's an 8-byte operand size ! Cool !
:) Agree. It's essential for 64-bit "push" and "pop" stack operations.
Just remember Epiphany wants a strict alignment of ANY item size you handle:
    - 2-byte array ? align it by 2-byte or you get an exception
    - 4-byte array ? align it by 4-byte or you get an exception
    - 8-byte array ? align it by 8-byte or you get an exception
So you must take EVEN registers to load or store data:
Code: Select all
  ldrd r0, [ r2 ] // OK, even register
  strd r0, [ r2 ] // OK, even register
  ldrd r1, [ r2 ] // KO, odd  register
  strd r1, [ r2 ] // KO, odd  register

If you have the arch. doc REV 14.03.11, forget this typo page 104: <size> Byte(B), Half(H), Word(), or Double(D)
and read <size> Byte(B), Short(S), Word(W), or Double(D)

:) Hey ! Will you talk about the stack ? What about push... pop... sp/esp/rsp... bp...
Not much to say. I just wanted to talk about ldr/str. I'll say the least for our purpose (recursive functions, remember ?)

    - The stack begins in the end of the local memory, 0x7FFC iirc.
    - It goes downwards.
    - You're responsible for its size.
    - You're responsible for its alignment. We should always target an 8-byte alignment for these performant ldrd/strd 8-byte instructions.
    - The register sp is a synonym for r13 - we need a register for the stack, so let's respect this current EABI implementation.
    - There is no push/pop: it's just ldr<SIZE>/str<SIZE>
    - fp is for debugging purpose for example ; I won't care, I just need speed


What is important to keep in mind ? Let's sum up this LDR/STR lesson.

    - no memory to memory operations (DMA is another topic) ; use the r0-r63 registers
    - strict data alignment
    - # notation is a good habit for immediates
    - Intel syntax except str <src>, [ <dest> ]
    - apart from the basic ldr/str syntax, you can add the indexing syntax OR the post-modify syntax, NOT BOTH
    - whenever you use SIMMx indexing or post-modify, the effective added offset is ( SIMMx * byte_size_operand)
    with B/S/W/D suffix standing for 1/2/4/8 byte(s), Byte/Short/Word/Double
    - whenever you use register-based indexing or post-modify, the effective added offset is the naked value of the register
    - no push / pop but ldrd/strd and a 8-byte stack alignment should be kept for performance reasons
    - RISC simplicity :)
    - Epiphany is great :D

Any typos... mistakes... questions ? You're welcome.
User avatar
Posts: 46
Joined: Fri Apr 29, 2016 9:58 pm

Re: Assembly class HERE

Postby aolofsson » Mon May 23, 2016 2:41 pm

Great writeup!
User avatar
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Assembly class HERE

Postby DonQuichotte » Thu May 26, 2016 3:24 pm


I've reached 81 Mn/s with full assembly... that's for the good news.

The fresh update is under
and all the new files under the *2016...05...26...tar...gz archive

1st you should read README.
I'm open to any question.
I put several versions of C or S files: I think there should be several steps for producing assembly.

Last edited by DonQuichotte on Mon May 30, 2016 9:44 am, edited 1 time in total.
User avatar
Posts: 46
Joined: Fri Apr 29, 2016 9:58 pm

Re: Assembly class HERE

Postby MiguelTasende » Fri May 27, 2016 12:53 pm

- The stack begins in the end of the local memory, 0x7FFC iirc.
- It goes downwards.

One curious detail: I think it doesn't strictly go "downwards", but "mostly, it does go downwards". That could be something important to know if you're running out of memory... I was, and wrote a small code to see the stack behaviour in runtime. It jumps downwards, then goes upwards until it covers it's jump, then backwards again, or so...
Also, to test it I used a recursive function (which is more "on topic" ;) )

Great Assembly tutorial, btw :)

The result is the one below (this stack was configured to begin in 0x3FF0). What is printed is the addres of a variable defined inside a recursive function. The function gets called (from itself) 100 times. Just an interesting fact.
Code: Select all
      FuncionR(100): 0x3fc0
      FuncionR(99): 0x3fc8
      FuncionR(98): 0x3fcc
      FuncionR(97): 0x3f98
      FuncionR(96): 0x3fa0
      FuncionR(95): 0x3fa4
      FuncionR(94): 0x3f70
      FuncionR(93): 0x3f78
      FuncionR(92): 0x3f7c
      FuncionR(91): 0x3f48
      FuncionR(90): 0x3f50
      FuncionR(89): 0x3f54
      FuncionR(88): 0x3f20
      FuncionR(87): 0x3f28
      FuncionR(86): 0x3f2c
      FuncionR(85): 0x3ef8
      FuncionR(84): 0x3f00
      FuncionR(83): 0x3f04
      FuncionR(82): 0x3ed0
      FuncionR(81): 0x3ed8
      FuncionR(80): 0x3edc
      FuncionR(79): 0x3ea8
      FuncionR(78): 0x3eb0
      FuncionR(77): 0x3eb4
      FuncionR(76): 0x3e80
      FuncionR(75): 0x3e88
      FuncionR(74): 0x3e8c
      FuncionR(73): 0x3e58
      FuncionR(72): 0x3e60
      FuncionR(71): 0x3e64
      FuncionR(70): 0x3e30
      FuncionR(69): 0x3e38
      FuncionR(68): 0x3e3c
      FuncionR(67): 0x3e08
      FuncionR(66): 0x3e10
      FuncionR(65): 0x3e14
      FuncionR(64): 0x3de0
      FuncionR(63): 0x3de8
      FuncionR(62): 0x3dec
      FuncionR(61): 0x3db8
      FuncionR(60): 0x3dc0
      FuncionR(59): 0x3dc4
      FuncionR(58): 0x3d90
      FuncionR(57): 0x3d98
      FuncionR(56): 0x3d9c
      FuncionR(55): 0x3d68
      FuncionR(54): 0x3d70
      FuncionR(53): 0x3d74
      FuncionR(52): 0x3d40
      FuncionR(51): 0x3d48
      FuncionR(50): 0x3d4c
      FuncionR(49): 0x3d18
      FuncionR(48): 0x3d20
      FuncionR(47): 0x3d24
      FuncionR(46): 0x3cf0
      FuncionR(45): 0x3cf8
      FuncionR(44): 0x3cfc
      FuncionR(43): 0x3cc8
      FuncionR(42): 0x3cd0
      FuncionR(41): 0x3cd4
      FuncionR(40): 0x3ca0
      FuncionR(39): 0x3ca8
      FuncionR(38): 0x3cac
      FuncionR(37): 0x3c78
      FuncionR(36): 0x3c80
      FuncionR(35): 0x3c84
      FuncionR(34): 0x3c50
      FuncionR(33): 0x3c58
      FuncionR(32): 0x3c5c
      FuncionR(31): 0x3c28
      FuncionR(30): 0x3c30
      FuncionR(29): 0x3c34
      FuncionR(28): 0x3c00
      FuncionR(27): 0x3c08
      FuncionR(26): 0x3c0c
      FuncionR(25): 0x3bd8
      FuncionR(24): 0x3be0
      FuncionR(23): 0x3be4
      FuncionR(22): 0x3bb0
      FuncionR(21): 0x3bb8
      FuncionR(20): 0x3bbc
      FuncionR(19): 0x3b88
      FuncionR(18): 0x3b90
      FuncionR(17): 0x3b94
      FuncionR(16): 0x3b60
      FuncionR(15): 0x3b68
      FuncionR(14): 0x3b6c
      FuncionR(13): 0x3b38
      FuncionR(12): 0x3b40
      FuncionR(11): 0x3b44
      FuncionR(10): 0x3b10
      FuncionR(9): 0x3b18
      FuncionR(8): 0x3b1c
      FuncionR(7): 0x3ae8
      FuncionR(6): 0x3af0
      FuncionR(5): 0x3af4
      FuncionR(4): 0x3ac0
      FuncionR(3): 0x3ac8
      FuncionR(2): 0x3acc
      FuncionR(1): 0x3a98
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Assembly class HERE

Postby DonQuichotte » Fri May 27, 2016 2:25 pm

:) Thanks Miguel
Well that's right, you do what you want with the stack, it may go upwards if you like.

Epiphany starts with 0x7ff0 or alike, but we can make our own stack(s). All of this is purely conventional, I mean a habit between programmers, some EABI statements...
by the way I did not respect the current EABI, I'm afraid ; for sure the chosen registers should be renamed...
but my code is not for the Linux kernel yet, or the NASA :lol:

It may seem superfluous to use, as I did - what do YOU think -,
prefixed variables
    "I_..." //immediates
    "O_..." // offsets
    "R_..." // critical register values
    "S_..." // Stackable
As far as I am concerned, writing in assembly is time consuming and it's very easy to make a mistake with recursive functions - hence this methodology.
I've written an Othello endgame solver a few years ago, bugs can bubble from unknown depths...
and the Eternity II solver - a backtracker actually - has NO leaf - except one solution among an eternity of nodes.

Dual issue profiling is not yet done... maybe another time ;)
Have a nice week-end everybody

Maybe you could share the full code with us or send it to me, Miguel - I don't understand its behaviour :?
User avatar
Posts: 46
Joined: Fri Apr 29, 2016 9:58 pm

Re: Assembly class HERE

Postby Olaf » Sat May 28, 2016 9:01 pm

DonQuichotte, I see you have been busy.

Deadlines at work so almost no time for Parallella, and too tired to do mental stuff.
But your ARM assembler explanation comes just in time, I was preparing to learn the assembly and did not know where to start yet.
Posts: 37
Joined: Sun May 08, 2016 8:47 pm

Re: Assembly class HERE

Postby Olaf » Sat May 28, 2016 9:27 pm

I have a question:

On x86 every register has an intentional purpose:

* AX --> Accumulator
* BX --> Source
* CX --> Counter
* DX --> Destination

How are the RISC register used, is there a certain convention how they are used?

Also the Stack is not very clear, r13 is used but how is something pushed and popped?
Posts: 37
Joined: Sun May 08, 2016 8:47 pm


Return to Assembly

Who is online

Users browsing this forum: No registered users and 1 guest