Postby MiguelTasende » Thu Jul 16, 2015 9:03 pm

I wanted to know... has anyone run the linpack with a parallella cluster?
I saw there was a challenge about this, some time ago. Anyone succeeded?

By now, I've run the linpack on the ARM cluster using OpenBLAS.
Now working on adding the Epiphany.

If anyone is interested in sharing thoughts, experiences, or linpack results, I would like to.

Right now, I have instantiated the BLAS via BLIS on a virtual cluster, modified a "gemm" kernel, built the linpack on top of it, and run it successfully (and checked that my kernel was being run by the linpack... with some "printf"s).
That is... I got to manage a container for a super-Epiphany powered matmul kernel. Now trying to fill it with an actual Epiphany powered matmul kernel :)
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Linpack

Postby aolofsson » Sun Aug 02, 2015 11:19 pm

Very cool! How did things go with Epiphany BLIS?
User avatar
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Linpack

Postby MiguelTasende » Thu Aug 06, 2015 4:47 pm

Still not there....
I managed to offload "dotproduct" to the Epiphany (in a terribly unoptimized way), and built the BLAS that uses that.
I completely rethought a Matmul algorithm that I think may solve all performance problems (uuuhhh... well, you have to be optimistic... we'll see), and by now have implemented it (partly: i'm offloading half and calculating half in the ARM by now).

Tasks for today:
- studying the linker scripts to allocate memory efficiently
- offload the second part of the algorithm to the Epiphany

After that I would have an initial prototype that I hope will be far more efficient than previous implementations (but may be wrong).

More tasks, for after the first prototype test:
- generalize the kernel for non-square matrices (it's an optimization since BLIS lets you use only "square kernels" and divides the original matrix accordingly, but if you let it use arbitrary matrices it optimizes more)
- create a double buffer
- optimize algorithm
- optimize code
- try to find a way not to hardware reset the Epiphany every time a call to the BLIS kernel is made

We'll be sending news...
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Linpack

Postby aolofsson » Thu Aug 06, 2015 5:46 pm

That's awesome!

Doubt you will get good performance sending dot-product to Epiphany. (too BW limited).

You should never have to mess around with linker descriptor files.:-(
Recommend combining work with this (or somerhing similar) if you will be doing offloading from ARM/ ... ter/cannon

User avatar
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Linpack

Postby MiguelTasende » Fri Aug 07, 2015 2:28 pm

Yes, the "dotproduct" part was only to show that I could combine all the things, get a BLAS library that does some real calculations on the Epiphany, and get correct results when running linpack. No good performance expected.
I my actual algorithm I am not using any dot product at all.

After looking at the linker descriptor files I decided that I will use internal.ldf for now and not to make any changes... but may need to understand better the way it works anyway (not to put some variables in the wrong place). At the moment I can multiply 64x64 matrices using 2 memory banks to store the local results (I use 1 for the code and 1 for "small variables"). Being conservative on that (also adding 1 more bank doesn't improve too much: could get 78x78).
I have an idea to try to enhance it 16 times, because I could theoretically use 2 banks for each core, reusing space for partial-partial-partial-...-final results (short explaination... :) ), but that will be a bit more complicated...

Will look at the code you say, but they seem to be using "coprthr", and I am not (by now). I'm using only ESDK for this first prototype. Later, it can be improved if necessary.
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Linpack

Postby MiguelTasende » Thu Mar 03, 2016 5:56 pm

These would be the first Linpack benchmark results... still a lot to optimize (it's a poor result, by now):

NOTE1: It was necessary to run the "Epiphany caller" algorithm from a different process for two reasons:
1) To avoid mmap() problems when the kernel is called many times
2) To avoid performance losses, due to initialization times

NOTE2: To accomodate to the HPL Linpack benchmark, which uses Double Precision, I created an hybrid version of the kernel. BLIS calls to dgemm, go to a custom dgemm kernel, that in fact does some casting and calls to the sgemm inner kernel.

A(192x4096), B(4096x256), C(192x256)
Operation: C_out = alpha * A * B + beta * C_in

---My own measurements--------------------------------------------------------------------------------------------------------------
My Matmul algorithm (run once, don't count initialization times): 3.4 GFLOPS
Same Matmul algorithm modified to run as a separate process (run from the same process): 3.04 GFLOPS
Same Matmul (run from another process: includes transfers to shared memory and synchronization times): 2.17 GFLOPS
Possible improvements (on that "phase"): Algorithm itself (great potential on improving "e_read" times from ARM to shared RAM), Adopting the new e-link, Improving Inter-process communications.

---BLIS measurements--------------------------------------------------------------------------------------------------------------
sgemm kernel (M=192, N=256, K=4096): 2.63 GFLOPS (run from separate process)
sgemm complete operation (M=N=K=4096): Between 2.035 GFLOPS and 2.456 GFLOPS (depending on the transpose,conjugate, etc, operations requested)
"false dgemm" kernel (M=192, N=256, K=4096): 2.073 GFLOPS
"false dgemm" complete operation (M=N=K=4096): Between 1.575 GFLOPS and 1.829 GFLOPS
Possible improvements (on that "phase"): I think the BLIS process is very efficient as it is, and is not wasting many FLOPS, by now.

---HPL Linpack benchmark--------------------------------------------------------------------------------------------------------------
Linpack (N = 4608, NB=768, many other options tweaked...): 495.7 MFLOPS
Possible improvements (on that "phase"): The many parameters of the HPL configuration file change the way the algorithm is run, and do have a great impact on the performance. Also, HPL is calling the "false dgemm" of the BLIS-BLAS library (if I had a native Single Precision algorithm to compile with the BLIS-BLAS would be better).

That's it by now, hope to improve what I can. I'll also run it on a cluster (maybe first, and then improve other things)
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Linpack

Postby snim2 » Thu Aug 04, 2016 9:24 pm

Hi @MiguelTasende this looks really interesting! Is your version of Linpack available publicly at all?


Posts: 53
Joined: Mon Feb 03, 2014 5:02 pm

Re: Linpack

Postby MiguelTasende » Mon Aug 15, 2016 7:17 pm

I am really sorry. The problem is that I am very inexperienced in these matters (and maybe I am in shaky grounds, also...).
The work was published in a conference (last week): IEEE DataCom 2016 (Auckland).
The original paper was 8 pages long, but I had to reduce it to 4 pages for publication. I would like to publish the 8 pages version (more explained) in ArXiv or similar, but I am asking to IEEE copyright section to see if it is possible.

It is the first paper I publish, and also the first time I will try to release software code from within the company I work for, so I am lost in a legal mess :)
The code is still not available.
By the end of this month I will have news (or die trying... :P ). In principle, there will be support for the release of the code, from the company.

Any general advice (from experienced publishers) could help, also.

By now I can tell the title of the paper was: "Generation of the Single Precision BLAS library for the Parallella platform, with Epiphany co-processor acceleration, using the BLIS
And I still can't find it in IEEE Xplore, or anywhere on the web (maybe it is too soon).
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Linpack

Postby jar » Tue Aug 16, 2016 3:45 am


You should be able to publish papers submitted to IEEE on a preprint, personal, or company server in order to collaborate. I have done this. See the IEEE FAQ:

Does IEEE consider authors posting their articles on preprint servers or on their companies' web sites to be a form of prior publication, which may then disqualify the articles from further editorial consideration?
No. IEEE policy allows authors to submit previously posted articles to IEEE publications for consideration as long as authors are able to transfer copyright to IEEE, i.e., they had not transferred copyright to another party prior to submission.

Does the policy affect how authors post their articles on preprint servers such as ArXiv?
Yes. The IEEE recognizes that many authors share their unpublished articles on public sites. Once articles have been accepted for publication by IEEE, authors are required to post an IEEE copyright notice on their preprints. Upon publication, authors must replace the preprints with either 1) the full citation to the IEEE works with Digital Object Identifiers (DOI) or 2) the accepted versions only (not the IEEE-published versions) with the DOI. IEEE journals will make available to each author the accepted version of the article that the author can post that includes the DOI, IEEE copyright notice, and a notice indicating the article has been accepted for publication by IEEE. IEEE conference authors are free to post their own version of their articles, as accepted by an IEEE conference.

I'm going to take this opportunity to plug two of my papers which were recently published at OpenSHMEM 2016 (with manuscripts on arXiv below):
An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor
OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture

User avatar
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Linpack

Postby MiguelTasende » Fri Aug 19, 2016 1:22 am

OK, thanks.
After dealing with some arXiv issues (initial problems with "endorsement"...), now it is done.
Can be downloaded here:

(it is the 8 pages manuscript. Later had to be reduced to 4, for the final submission to the conference. The final submission is still not published, as far as I know.)

About the code, I hope to be able to release it soon.
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm


Return to Clustering

Who is online

Users browsing this forum: No registered users and 2 guests