Single precision Linpack benchmark code

Re: Single precision Linpack benchmark code

Postby jar » Wed Feb 24, 2016 3:24 am

Good job! That sounds like a very good result.

Have you tried the new e-link interface?
User avatar
jar
 
Posts: 294
Joined: Mon Dec 17, 2012 3:27 am

Re: Single precision Linpack benchmark code

Postby sebraa » Wed Feb 24, 2016 4:31 pm

jar wrote:Have you tried the new e-link interface?
Are there any images using the new eLink available? The web page still lists the images from 2015 (headless) and 2014 (HDMI).
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Single precision Linpack benchmark code

Postby MiguelTasende » Wed Feb 24, 2016 7:29 pm

sebraa wrote:
jar wrote:Have you tried the new e-link interface?
Are there any images using the new eLink available? The web page still lists the images from 2015 (headless) and 2014 (HDMI).


sebraa, I don't think there are "easy images" to copy, but at least peteasa got it running from oh repository, I think. Look here:

https://www.parallella.org/forums/viewt ... 0&start=10


jar, I still haven't tried it. I am now building the BLAS, polishing some details as strides for all matrices, padding for k%Kkernel != 0, etc.

In that process I found that the "e_init" function is taking some performance away. I wasn't taking that function into account, but I think for good reasons, as it's not part of the "data-operations-data" flow, and in principle the BLAS acceleration could be run as a daemon, that has already called it, or I would find an "init" section in the full BLIS sgemm (not the kernel), to run the "e_init", or I will totally not use e-hal.

I came to think that it could be very beneficial to use the Zynq with "Asymmetric Multicore Processing". One core for the Linux, one core bare-metal. The bare-metal core would communicate with the Epiphany (giving much more freedom on that). But of course, that still has to be done... (I read that someone around was using that configuration, but without the Epiphany ["Herbert robot" was the project, I think]).

P.S.: As for the original question: "how to run the Single Precision Linpack in a cluster?", I think I'll try to "trick" the hpl code (use dgemm -> downcast-> compute S.P. -> upcast, and set the error tolerance to 2^(mantissa difference from double to single) )... may work, uh? :P (we'll see...)
MiguelTasende
 
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Single precision Linpack benchmark code

Postby MiguelTasende » Wed Mar 02, 2016 6:18 pm

Instead of the "bare-metal", I learned some Linux programming...
I made the "daemon" (it's just an independent process run from a second ssh terminal, by now. Can be "daemonized" easily but it's not necessary for tests).

I am using shared memory and semaphores to communicate processes, and after my (still rough) interprocess communication, I lost some GFLOPS, but the results are not that bad (and can always be improved). I get about 2,6GLFOPS for the kernel, and about 2,0GFLOPS for complete sgemm, according to BLIS testsuite (I copy the results):

Code: Select all
%
% test gemm_ukr seq front-end?    1
% gemm_ukr k                      4096
% gemm_ukr operand params         (none)
%

% blis_<dt><op>_<stor>                           m      n      k        gflops     resid      result
blis_sgemm_ukr_c          (   1, 1:5 ) = [   192   256  4096    2.630  1.18e-07 ]; % PASS

% --- gemm ---
%
% test gemm seq front-end?    1
% gemm m n k                  4096 4096 4096
% gemm operand params         ??
%

% blis_<dt><op>_<params>_<stor>            m      n       k       gflops    resid        result
blis_sgemm_nn_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.381  4.52e-07 ]; % PASS
blis_sgemm_nc_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.381  4.79e-07 ]; % PASS
blis_sgemm_nt_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.455  4.77e-07 ]; % PASS
blis_sgemm_nh_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.456  4.65e-07 ]; % PASS
blis_sgemm_cn_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.381  4.69e-07 ]; % PASS
blis_sgemm_cc_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.381  4.75e-07 ]; % PASS
blis_sgemm_ct_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.455  4.67e-07 ]; % PASS
blis_sgemm_ch_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.455  4.59e-07 ]; % PASS
blis_sgemm_tn_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.034  4.50e-07 ]; % PASS
blis_sgemm_tc_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.036  4.64e-07 ]; % PASS
blis_sgemm_tt_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.090  4.55e-07 ]; % PASS
blis_sgemm_th_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.094  4.89e-07 ]; % PASS
blis_sgemm_hn_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.035  4.67e-07 ]; % PASS
blis_sgemm_hc_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.037  4.69e-07 ]; % PASS
blis_sgemm_ht_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.090  4.69e-07 ]; % PASS
blis_sgemm_hh_ccc         (   1, 1:5 ) = [  4096  4096  4096    2.094  4.63e-07 ]; % PASS


The "casting trick" does work indeed, but I only tested it with an old implementation, by now (very bad performance). I am now going to test it with this one, to get the "first reasonable linpack result". If I get a 2 GFLOPS linpack result I will be happy for an initial "prototype" (I have many ideas to improve it later).

P.S.: I copy the results from "false dgemm" (dgemm with a sgemm core and "casting trick"). It loses some performance (casting process I suppose). Precision improves a bit (external gemm operations are done in double precision). "FAILURE" results were expected, but are ok: the residues are in about 1e-8 which is ok for single precision (I don't know how to change BLIS precision requirements; I can do it easily with linpack).

Code: Select all
%
% test gemm_ukr seq front-end?    1
% gemm_ukr k                      4096
% gemm_ukr operand params         (none)
%

% blis_<dt><op>_<stor>               m     n     k   gflops   resid      result
blis_dgemm_ukr_c          (   1, 1:5 ) = [   192   256  4096    2.073  9.33e-09 ]; % FAILURE

% --- gemm ---
%
% test gemm seq front-end?    1
% gemm m n k                  4096 4096 4096
% gemm operand params         ??
%

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.785  1.30e-08 ]; % FAILURE
blis_dgemm_nc_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.785  1.28e-08 ]; % FAILURE
blis_dgemm_nt_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.829  1.32e-08 ]; % FAILURE
blis_dgemm_nh_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.828  1.28e-08 ]; % FAILURE
blis_dgemm_cn_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.784  1.30e-08 ]; % FAILURE
blis_dgemm_cc_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.783  1.29e-08 ]; % FAILURE
blis_dgemm_ct_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.828  1.28e-08 ]; % FAILURE
blis_dgemm_ch_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.828  1.29e-08 ]; % FAILURE
blis_dgemm_tn_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.580  1.27e-08 ]; % FAILURE
blis_dgemm_tc_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.578  1.29e-08 ]; % FAILURE
blis_dgemm_tt_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.613  1.28e-08 ]; % FAILURE
blis_dgemm_th_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.611  1.26e-08 ]; % FAILURE
blis_dgemm_hn_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.579  1.29e-08 ]; % FAILURE
blis_dgemm_hc_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.575  1.29e-08 ]; % FAILURE
blis_dgemm_ht_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.615  1.31e-08 ]; % FAILURE
blis_dgemm_hh_ccc         (   1, 1:5 ) = [  4096  4096  4096    1.614  1.28e-08 ]; % FAILURE
MiguelTasende
 
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Single precision Linpack benchmark code

Postby mjvbhaskar1000 » Fri Jun 10, 2016 8:10 pm

What were the final values you got?
Did u post the code somewhere?
mjvbhaskar1000
 
Posts: 11
Joined: Tue Jun 07, 2016 3:04 pm

Re: Single precision Linpack benchmark code

Postby MiguelTasende » Mon Jun 27, 2016 2:48 pm

What were the final values you got?
Did u post the code somewhere?


I have submitted a paper (accepted as "poster") to the IEEE DataCom 2016 (it will be published in August, in the Proceedings of the conference). I will try to see if I am allowed to publish the full version in ArXiv, before that.
I am still trying to get authorized (this work was part of my job) to release the code (I hope I will, soon; I don't think there are "commercial" reasons not to do so, just need to overcome some "paperwork")

The "final" results, are more or less the ones here: viewtopic.php?f=32&t=3095
The matrix multiplication algorithm improved a bit with the new e-link, and also made some further changes (a "streaming" version of the algorithm), but not yet completed (I had to switch to other projects... hope to come back, maybe).
MiguelTasende
 
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Single precision Linpack benchmark code

Postby MiguelTasende » Wed Aug 24, 2016 4:29 pm

Good news.
The code is released under Mozilla 2.0 license.
It is not perfectly "polished" (there may be some unused files included, and you'll find many comments and variable names in Spanish, among other things), but it works (at least here... hope to hear about other people testing it).

The link to GitHub is here:

https://github.com/mtasende/BLAS_for_Parallella

IMPORTANT NOTE: Most of the use cases require running 2 processes on the Linux host. That is explained in the README.txt, but it is something different from a regular code, and easy to forget.

The link to the paper manuscript explaining the code is here:

https://arxiv.org/abs/1608.05265
https://arxiv.org/pdf/1608.05265v1.pdf
MiguelTasende
 
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Previous

Return to Scientific Computing

Who is online

Users browsing this forum: No registered users and 2 guests

cron