105. Benchmarking Vector API
Benchmarking Vector API can be accomplished via JMH. Let’s consider three Java arrays (x, y, z) each of 50,000,000 integers, and the following computation:
z[i] = x[i] + y[i];
w[i] = x[i] * z[i] * y[i];
k[i] = z[i] + w[i] * y[i];
So, the final result is stored in a Java array named k. And, let’s consider the following benchmark containing four different implementations of this computation (using a mask, no mask, unrolled, and plain scalar Java with arrays):
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@BenchmarkMode({Mode.AverageTime, Mode.Throughput})
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Benchmark)
@Fork(value = 1, warmups = 0,
jvmArgsPrepend = {“–add-modules=jdk.incubator.vector”})
public class Main {
private static final VectorSpecies<Integer> VS
= IntVector.SPECIES_PREFERRED;
…
@Benchmark
public void computeWithMask(Blackhole blackhole) {…}
@Benchmark
public void computeNoMask(Blackhole blackhole) {…}
@Benchmark
public void computeUnrolled(Blackhole blackhole) {…}
@Benchmark
public void computeArrays(Blackhole blackhole) {…}
}
Running this benchmark on Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz machine with Windows 10 has produced the following results:

Figure 5.9 – Benchmark results
Overall, executing the computation using data-parallel capabilities perform the best having the highest throughput and best average time.
106. Applying Vector API to compute FMA
In Java Coding Problems, First Edition, Chapter 1, Problem 38 we have covered the Fused Multiply Add (FMA). In a nutshell, FMA is the mathematical computation (a*b) + c which is heavily exploited in matrix multiplications.Implementing FMA via Vector API can be done via the fma(float b, float c) or fma(Vector<Float> b, Vector<Float> c) operation which is the one used here shortly.Let’s assume that we have the following two arrays:
float[] x = new float[]{1f, 2f, 3f, 5f, 1f, 8f};
float[] y = new float[]{4f, 5f, 2f, 8f, 5f, 4f};
Computing FMA(x, y) can be express as the following sequence: 4 + 0 = 4 → 10 + 4 = 14 → 6 + 1 4 = 20 → 40 + 20 = 60 → 5 + 60 = 65 → 32 + 65 = 97. So, FMA(x, y) = 97. Expressing this sequence via Vector API can be done as in the following code:
public static float vectorFma(float[] x, float[] y) {
int upperBound = VS.loopBound(x.length);
FloatVector sum = FloatVector.zero(VS);
int i = 0;
for (; i < upperBound; i += VS.length()) {
FloatVector xVector = FloatVector.fromArray(VS, x, i);
FloatVector yVector = FloatVector.fromArray(VS, y, i);
sum = xVector.fma(yVector, sum);
}
if (i <= (x.length – 1)) {
VectorMask<Float> mask = VS.indexInRange(i, x.length);
FloatVector xVector = FloatVector.fromArray(
VS, x, i, mask);
FloatVector yVector = FloatVector.fromArray(
VS, y, i, mask);
sum = xVector.fma(yVector, sum);
}
float result = sum.reduceLanes(VectorOperators.ADD);
return result;
}
Have you noticed the code line, sum = xVector.fma(yVector, sum)? This is equivalent to sum = xVector.mul(yVector).add(sum).The novelty here consists of the line:
float result = sum.reduceLanes(VectorOperators.ADD);
This is an associative cross-lane reduction operation (see figure 5.6). Before this line, the sum vector looks as follows:
sum = [9.0, 42.0, 6.0, 40.0]
By applying the reduceLanes(VectorOperators.ADD) we sum the values of this vector and reduce it to the final result, 97.0. Cool, right?!