Page 1 of 1

paper on extended "double-single" precision

PostPosted: Sat Jun 14, 2014 11:05 am
by notzed
The paper ``Extended-Precision Floating-Point Numbers for GPU Computation'' by Andrew Thall may be of use for the epiphany.

It outlines some basic flops for a double-single format that I presume should be faster (and smaller) than a software-ieee-double library. All operations are decomposed into 32-bit flops so can run on hardware. It approximately doubles the mantissa accuracy but doesn't extend the exponent.

I downloaded it from here: http://andrewthall.net/papers/df64_qf128.pdf

Re: paper on extended "double-single" precision

PostPosted: Sat May 27, 2017 11:28 pm
by upcFrost
Actually it might worth trying to implement. I'll try probably, at least basic ops

Re: paper on extended "double-single" precision

PostPosted: Sun May 28, 2017 12:42 am
by jar
I had experimented with this some time ago and didn't find it very worthwhile. Consider the df64_mult routine, which requires 8 multiplications, 10 subtractions and 6 additions by my count. It should be used very selectively.

Code: Select all
float2 df64_mult(float2 a, float2 b) { // 8 mul + 10 sub + 6 add
   float2 p;
   p = twoProd(a.x, b.x); // 6 mul + 8 sub + 3 add
   p.y += a.x * b.y;
   p.y += a.y * b.x;
   p = quickTwoSum(p.x, p.y); // 2 sub + 1 add
   return p;
}

float2 quickTwoSum(float a, float b) { // 2 sub + 1 add
   float s = a + b;
   float e = b - (s - a);
   return float2(s, e);
}

float2 twoProd(float a, float b) { // 6 mul + 8 sub + 3 add
   float p = a * b;
   float2 aS = split(a); // 1 mul + 3 sub
   float2 bS = split(b); // 1 mul + 3 sub
   float err = ((aS.x * bS.x - p) + aS.x * bS.y + aS.y - bS.x) + aS.y * bS.y;
   return float2(p, err);
}

float2 split(float a) { // 1 mul + 3 sub
   const float split = 4097; //(1<<12)+1;
   float t= a * split;
   float ahi = t - (t - a);
   float alo = a - ahi;
   return float2(ahi, alo);
}