Ive found a situation where a loop is faster than certain vectorized vDSP functions for double precision values, at least on my late-2013 15.4″ base-model rMBP running OS X 10.9.4; with single-precision, the vDSP functions are faster though.
I am using the Box-Muller transform to generate normally-distributed values. Ill attach my full code, but here is what transpired. I wrote a straightforward, non-vectorized implementation, and it works fine. Then I wanted to see if I could speed it up with vectorized functions, so I wrote that version, and it is indeed faster (it runs in about three-fourths the time).
For the heck of it, I decided to try replacing some vectorized function calls, notably multiplication, with loops instead. I kept the vectorized versions of complicated things like square roots and natural logs, but did the multiplication in loops. Sort of a hybrid between the two implementations. And strangely enough the hybrid version actually runs faster (about 10-15% less time) than the vectorized version.
Finally, I re-implemented all three versions using single-precision floats rather than doubles. But with floats, the vectorized version runs fastest (about 3-5% less time than the hybrid version). So I am not really sure what is going on with the double-precision case. Here are some typical run-times for generating 2^24 (approx. 16 million) normal values:
Float
Looped: 0.290083
Vector: 0.161175
Hybrid: 0.166553
Double
Looped: 0.461659
Vector: 0.332181
Hybrid: 0.298434
Any explanation for why vDSP multiplication of doubles runs slower than a loop in this situation?
Here is the code for the simple, non-vectorized, looped version:
The vectorized version:
And the hybrid version:
Full code attached. I linked against the Accelerate framework and compiled in Xcode 5.1.1 as a command-line tool with the default optimizations for a release version, and ran it in Terminal.
I am using the Box-Muller transform to generate normally-distributed values. Ill attach my full code, but here is what transpired. I wrote a straightforward, non-vectorized implementation, and it works fine. Then I wanted to see if I could speed it up with vectorized functions, so I wrote that version, and it is indeed faster (it runs in about three-fourths the time).
For the heck of it, I decided to try replacing some vectorized function calls, notably multiplication, with loops instead. I kept the vectorized versions of complicated things like square roots and natural logs, but did the multiplication in loops. Sort of a hybrid between the two implementations. And strangely enough the hybrid version actually runs faster (about 10-15% less time) than the vectorized version.
Finally, I re-implemented all three versions using single-precision floats rather than doubles. But with floats, the vectorized version runs fastest (about 3-5% less time than the hybrid version). So I am not really sure what is going on with the double-precision case. Here are some typical run-times for generating 2^24 (approx. 16 million) normal values:
Float
Looped: 0.290083
Vector: 0.161175
Hybrid: 0.166553
Double
Looped: 0.461659
Vector: 0.332181
Hybrid: 0.298434
Any explanation for why vDSP multiplication of doubles runs slower than a loop in this situation?
Here is the code for the simple, non-vectorized, looped version:
Code:
#define SCALE_D 0x1p-53
#define RAND_53 ((random() << 22) | (random() >> 9))
#define RAND_D ((double)RAND_53)
#define EXP_D (-log(SCALE_D * (double)(RAND_53 + 1L)))
double randNormD(double * const x) {
// Box-Muller transform
const double r = sqrt(2.0 * EXP_D);
const double t = (2.0 * (double)M_PI * SCALE_D) * RAND_D;
*x = r * cos(t);
return r * sin(t);
}
void loopNormD(double * const v, const int n) {
for (int i = 0; i < n-1; i += 2) v[i] = randNormD(&v[i+1]);
if ((n % 2) && (n > 0)) v[n-1] = randNormD(&v[n-1]);
}
The vectorized version:
Code:
#define BATCH_SIZE 1024
#define HALF_BATCH 512
void vecNormD(double * v, const int n) {
for (int i = n/BATCH_SIZE; i-->0; v += BATCH_SIZE) vecNormHelpD(v);
loopNormD(v, n%BATCH_SIZE);
}
void vecNormHelpD(double * const v) {
const int h = HALF_BATCH;
const double s = SCALE_D;
const double p = 2.0 * M_PI * SCALE_D;
const double negtwo = -2.0;
double temp[h];
for (int i = 0; i < h; i++) temp[i] = RAND_D;
vDSP_vsmsaD(temp, 1, &s, &s, temp, 1, h);
vvlog(temp, temp, &h);
vDSP_vsmulD(temp, 1, &negtwo, temp, 1, h);
vvsqrt(temp, temp, &h);
for (int i = 0; i < h; i++) v[i] = RAND_D;
vDSP_vsmulD(v, 1, &p, v, 1, h);
vvsincos(v, v, v+h, &h);
vDSP_vmulD(v, 1, temp, 1, v, 1, h);
vDSP_vmulD(v+h, 1, temp, 1, v+h, 1, h);
}
And the hybrid version:
Code:
void vecNormD2(double * v, const int n) {
for (int i = n/BATCH_SIZE; i-->0; v += BATCH_SIZE) vecNormHelpD2(v);
loopNormD(v, n%BATCH_SIZE);
}
void vecNormHelpD2(double * const v) {
const int h = HALF_BATCH;
const double s = SCALE_D;
const double p = 2.0 * (double)M_PI * SCALE_D;
double * const w = v + h;
double temp[h];
for (int i = 0; i < h; i++) temp[i] = RAND_D;
vDSP_vsmsaD(temp, 1, &s, &s, temp, 1, h);
vvlog(temp, temp, &h);
for (int i = 0; i < h; i++) temp[i] *= -2.0;
vvsqrt(temp, temp, &h);
for (int i = 0; i < h; i++) v[i] = p * RAND_D;
vvsincos(v, v, v+h, &h);
for (int i = 0; i < h; i++) {
v[i] *= temp[i];
w[i] *= temp[i];
}
}
Full code attached. I linked against the Accelerate framework and compiled in Xcode 5.1.1 as a command-line tool with the default optimizations for a release version, and ran it in Terminal.
Attachments
Last edited: