Hello jar, thanks for the answer. I'll try to explain.

OK, by parts (I won't comment everything here, it's too long, that will be when [and if] I have actual better results):

1) I am already (almost) fully overlapping communication with computation (ARM handles input communication by now, while Epiphany computes [and it is done in chunks of 128x128x2 on the input]; output communication is irrelevant in this implementation)

2) I am using e_write, by now, with an experimental bandwidth of about 45MB/s (Which probably gives the 1.4GFLOPS limit, I've done the math before but won't check it now; it is your 2.88GFLOPS/2 [for 90MB/s] in any case, so it sounds good). I know I can use ERAM, and possibly double it, but I am working on the other parts first.

3) I know the limits of multiplying 128x128 matrices. That multiplication is done "on-chip" and, as you say, can't achieve much better performance (it can improve with the bandwidth but not by other means). It is possible to improve to 180x180 matrices on chip (I've done it) but that's not the interesting point (little improvement). It is also possible to accumulate the results, as you move on " the k dimension", to reduce the weight of the "post-processing" (I am doing that, and post-processing goes down to less than 2%).

4) Now the interesting part: I am using an algorithm that "extends nicely" to matrices that are bigger than the storage of the chip. For example, I have one implementation of "pure SUMMA" algorithm, for A(512x512) B(512x512), in which the input transfer time is totally irrelevant. As a secondary effect, the output grows a lot (and that is O(N^2) so you normally don't want it...). But that is with "pure SUMMA + ARM postprocessing". The interesting thing is:

"pure SUMMA" = little input/ big output (a lot of ARM post-processing, and Epiphany->ARM comm)

"pure classical" = big input/little output (a lot of ARM->Epiphany comm)

What I am doing is mixing both, hoping to find a good equilibrium that, if lucky enough, will reach close to the computation speed bounds (hopefully with some help on the bandwidth side... which seems to be possible thanks to Andreas; but the point is "I think it is possible to be more efficient on the outer algorithm side also").

I expect to have an answer to this "luck" issue this week or the next (I hope to finish the "outer part of my matmul", still missing the inner function optimization, but being able to predict the "data transfer" limitations already. For the "inner function" I may borrow the "matmul_optimized" assembly code, that comes with the parallella examples [the authors are from the Australian National University]).

5) Postprocessing is important for me, as I am not making "matmul" but a complete "sgemm" kernel to be used in BLAS. That means taking into account alpha, beta, cs_a, rs_a, cs_b, rs_b, cs_c, rs_c,... etc... (I've been learning about those on the run, hehe, but now I seem to handle them ok).

Finally: If I get a good kernel I already have the needed "surrounding technology" to build a "kernel-powered BLAS" within an afternoon, and I have a non-standard test that I can run on it. But I want Linpack! (imagine the voice of a crying baby saying that

), and the (great) hpl version of Linpack I have is for doubles.... that's the situation by now... (also have plans to improve the "double" version with "extended precision" as (I think it was you, in fact) suggested, but that will probably take more time [unless there is a volunteer... it's totally parallel work: optimizing an already existing ASM function to work with extended precision]. I still haven't even tested the ASM compiler even once, so I would have at least some time to get used to it, installing the tools, etc).