Current status

Moderator: Hoernchen

Current status

Postby upcFrost » Tue Apr 04, 2017 4:05 pm

I'll probably use this topic to update the backend status from time to time

So, today i've managed to compile and run dotproduct demo. Yay. The instruction choice quality is, well, meh. Still can't figure out how to get rid of load/store pairs generated due to the default calling convention. Probably i'll need to adjust the frontend a bit, so that it will at least know about the number of regs.
Anyway, it works. I'll try to compile other demos tomorrow. The biggest PITA is the FPU config flag, as it requires additional pass to set, and this pass is far from being perfect.
Current LLVM backend for Epiphany: https://github.com/upcFrost/Epiphany. Commits and code reviews are welcome
upcFrost
 
Posts: 21
Joined: Wed May 28, 2014 6:37 am

Re: Current status

Postby jar » Wed Apr 05, 2017 5:33 am

In general, it would help if you included the commands needed to install the code rather than bullet point descriptions of them. More specifically, applying the patch was not as simple as it could be.

LLVM requires a recent version of cmake. It seems most package managers are way behind. Have I mentioned I hate cmake? I've now tried to build cmake on a three separate platforms without success. I'm running out of steam and will try some other time.

I am not a compiler guy, but I'm interested in this work. In the context of Parallella, a feature I would really like to see in a compiler is the ability to use target multiple architectures with a function attribute. This would enable a monolithic code base and move away from the co-design model we have today with Epiphany and other coprocessors. Something like this would be a nice start:

Code: Select all
void __attribute__((targets(ARM,E32))) foo(void* p)
{
   // ...
}
int main(void)
{
   launch_thread_function(ARM,&foo,&args);
   launch_thread_function(E32,&foo,&args);
}
User avatar
jar
 
Posts: 212
Joined: Mon Dec 17, 2012 3:27 am

Re: Current status

Postby upcFrost » Wed Apr 05, 2017 11:17 pm

jar wrote:In general, it would help if you included the commands needed to install the code rather than bullet point descriptions of them. More specifically, applying the patch was not as simple as it could be.


Yeah, I know, I'll update it. The reason why the patch is outdated is because it was made for LLVM 3.9.1, and currently I migrated to LLVM 4.0.0. I realised that I'll need to make some modification to Clang also, namely I need to specify the Epiphany target as 32-bit, otherwise it screws function calls that use size_t if you'll use x86_64 clang (it will use native size_t). I'll update the LLVM patch, publish the Clang patch, and update instructions after I'll finish with this stuff. And yeah, I'll make packages available, so don't worry about compiling it from scratch.

jar wrote:the ability to use target multiple architectures with a function attribute.


Frontend-specific part, and a bit complicated one. And it might be that's not the best option available, either. I'd rather try to bring GPGPU/Hydra/whatever to know about Epiphany so that it'll be able to automatically generate kernels and insert loader code during the compilation. This way it'll be possible to use Epiphany CPU for minor everyday tasks, such as processing archives or serving web pages, without rewriting the target application (and I doubt someone wants to mess with, say, Nginx source).
The benefit is that those middlewares work with LLVM IR code, so it won't matter what language you're using (otherwise we'll need to rework every single frontend).
IIRC, GPGPU aka ppcg was nominated for this year's GSoC, and also I've heard about some Zurich guys moving in this direction. Yeah, it all works for CUDA and OpenCL, but as Epiphany ELF loading code is not that hard, it won't be hard to make at least some parts of the main code to be automatically transformed into kernels.

So, yeah, today's update - found this size_t bug, will need to fix it on the frontend side. Not the best option, imo, but it seems to be the only way around. Also tested e_bandwidth_test, e_led_test and hello_world examples. The latter one fails because of size_t, other two run just fine.

Upd.: found another way, compiling with -m32 flag works just fine.
Current LLVM backend for Epiphany: https://github.com/upcFrost/Epiphany. Commits and code reviews are welcome
upcFrost
 
Posts: 21
Joined: Wed May 28, 2014 6:37 am

Re: Current status

Postby upcFrost » Tue Apr 11, 2017 3:55 pm

Today finally got through the "Hello World" example. Actually it's not as simple as it might be seen. As e_write function takes 6 args, and we only have 4 scratch regs, mem placement for the last 2 args was failing. Fixed now.
Also, tried running basic_math example. After some jumping around, it compiled and even managed to complete half of the tests correctly. Performance-wise still... meh :roll:
Current LLVM backend for Epiphany: https://github.com/upcFrost/Epiphany. Commits and code reviews are welcome
upcFrost
 
Posts: 21
Joined: Wed May 28, 2014 6:37 am

Re: Current status

Postby upcFrost » Thu Apr 13, 2017 9:50 am

Basic_math example works now. The result is, well, comparable with e-gcc except division, as I'm currently using standard __divsf2, not __fast_recipsf2 optimized for E16. Some difference in results comes from LLVM scheduling (+- 4 cycles)
With this i'd say that basic functionality is added, so it's time to do some optimization and bugfixes

Also, I've updated patch and readme in 64bit branch. I'll probably merge it into main branch in a couple of days, maybe even today.
Current LLVM backend for Epiphany: https://github.com/upcFrost/Epiphany. Commits and code reviews are welcome
upcFrost
 
Posts: 21
Joined: Wed May 28, 2014 6:37 am

Re: Current status

Postby jar » Thu Apr 13, 2017 1:54 pm

Some thoughts on compiler optimizations that I would like to see based on my experience with GCC:

1) Load/Store Postmodify in array lookups and loops should be used instead of an arithmetic operation to increment/decrement an index register (an unnecessary instruction and extra clock)
2) Mask operations that can be replaced with bitwise operations for smaller/faster code
3) Hardware loops are fun, but I think this may be a challenge to get right since the code layout has to be just right. It's often larger code and actually slower for very small loops so it must be used judiciously. GCC doesn't touch them.
4) Preferential use of r0-r3 in leaf functions for smaller code. GCC seems to just use the higher registers as if they were valued the same as r0-r3, which enable 16-bit instructions.
5) Dual-issue loads/stores and FPU operations. You can sometimes move around instructions to improve performance. You can also zero initialize registers early with the FPU (fsub rx, rx, rx) rather than a mov instruction (but that may be a 32-bit instruction instead of 16-bits in some cases).

In hand-writing some assembly routines, I have copied one of the r0-r3 registers to a higher register to free it up in order to save instruction space (despite costing one 32-bit instruction and one clock cycle). The design tradeoff space is huge despite being a RISC architecture.

Good luck
User avatar
jar
 
Posts: 212
Joined: Mon Dec 17, 2012 3:27 am

Re: Current status

Postby upcFrost » Thu Apr 20, 2017 1:09 pm

jar wrote:Load/Store Postmodify in array lookups and loops should be used instead of an arithmetic operation to increment/decrement an index register (an unnecessary instruction and extra clock)

Will do. Actually postmod support is integrated into llvm. though the instruction spec for this is a bit tricky.

jar wrote:Mask operations that can be replaced with bitwise operations for smaller/faster code

No always. And the generalization might get nasty

jar wrote:Hardware loops are fun, but I think this may be a challenge to get right since the code layout has to be just right. It's often larger code and actually slower for very small loops so it must be used judiciously. GCC doesn't touch them.

Not that hard, actually. There's an internal loop analyzer in LLVM, so it's quite easy to check the loop length and counter and see if we can benefit from HW loop.

jar wrote:Preferential use of r0-r3 in leaf functions for smaller code. GCC seems to just use the higher registers as if they were valued the same as r0-r3, which enable 16-bit instructions.

ugh... that's a bit hard. Honestly, i'd also prefer using higher regs because even for the "basic_math" example the reg pressure gets high. Also, r4-r7 are callee-saved, which means that they take both stack and load/store to use. Not critical, though. There's just a small issue on LLVM's inst selector, it tends to grow reg pressure really fast by putting constrains on the RegAlloc. I think I'll default to all 64 regs and 32-bit instructions with r0-r7 preferred, and then run additional pass swapping instructions to 16-bit where it's possible.

jar wrote:Dual-issue loads/stores and FPU operations. You can sometimes move around instructions to improve performance. You can also zero initialize registers early with the FPU (fsub rx, rx, rx) rather than a mov instruction (but that may be a 32-bit instruction instead of 16-bits in some cases).

Yeah, maybe I'll try. It should probably go together with IALU-to-IALU2 pass to optimize the pipe.


And a small update. I've reworked the stack, added fast division call, and fixed a bunch of bugs. Now the performance is on par with GCC. There're still some issues, but in general it works on most of the examples provided.
Current LLVM backend for Epiphany: https://github.com/upcFrost/Epiphany. Commits and code reviews are welcome
upcFrost
 
Posts: 21
Joined: Wed May 28, 2014 6:37 am


Return to LLVM Compiler

Who is online

Users browsing this forum: No registered users and 1 guest