Content |
Prefetching Panels
First Attempt
Check out the demo-sse-all-asm-try-prefetching branch:
$shell> git branch -a demo-naive-sse-with-intrinsics demo-naive-sse-with-intrinsics-unrolled demo-pure-c * demo-sse-all-asm demo-sse-asm demo-sse-asm-unrolled demo-sse-asm-unrolled-v2 demo-sse-asm-unrolled-v3 demo-sse-intrinsics demo-sse-intrinsics-v2 demo-sse-intrinsics-v3 master remotes/origin/HEAD -> origin/master remotes/origin/bench-atlas remotes/origin/bench-blis remotes/origin/bench-eigen remotes/origin/bench-mkl remotes/origin/blis-avx-microkernel remotes/origin/demo-naive-avx-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics-unrolled remotes/origin/demo-pure-c remotes/origin/demo-sse-all-asm remotes/origin/demo-sse-all-asm-try-prefetching remotes/origin/demo-sse-all-asm-try-prefetching-v2 remotes/origin/demo-sse-all-asm-with-prefetching remotes/origin/demo-sse-asm remotes/origin/demo-sse-asm-for-AB-loop remotes/origin/demo-sse-asm-unrolled remotes/origin/demo-sse-asm-unrolled-v2 remotes/origin/demo-sse-asm-unrolled-v3 remotes/origin/demo-sse-asm-unrolled-with-prefetch remotes/origin/demo-sse-intrinsics remotes/origin/demo-sse-intrinsics-for-AB-loop remotes/origin/demo-sse-intrinsics-v2 remotes/origin/demo-sse-intrinsics-v3 remotes/origin/demo-with-sse-intrinsics remotes/origin/master remotes/origin/trsm-assignment remotes/origin/trsm-pure-c $shell> git checkout -B demo-sse-all-asm-try-prefetching remotes/origin/demo-sse-all-asm-try-prefetching Switched to a new branch 'demo-sse-all-asm-try-prefetching' Branch demo-sse-all-asm-try-prefetching set up to track remote branch demo-sse-all-asm-try-prefetching from origin.
Then we compile the project
$shell> make
make -C src
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm_nn.o level3/dgemm_nn.c
ar cru ../libulmblas.a auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
make[1]: Nothing to be done for `all'.
make -C test
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
Code Modifications
-
The macro kernel now also also passes pointers to the next panels of A and B to the micro kernel.
-
In the micro kernel we add prefetch instructions.
-
TODO: More details ...
-
TODO: Add link to course material ...
Benchmark Results
We run the benchmarks
$shell> cd bench $shell> ./xdl3blastst -N 100 2000 100 > report
and filter out the results for the demo-sse-all-asm-try-prefetching branch:
$shell> grep PASS report > demo-sse-all-asm-try-prefetching $shell> cat demo-sse-all-asm-try-prefetching 0 N N 100 100 100 1.0 2000 2000 1.0 2000 0.00 6006.0 3.31 PASS 1 N N 200 200 200 1.0 2000 2000 1.0 2000 0.00 7407.4 3.81 PASS 2 N N 300 300 300 1.0 2000 2000 1.0 2000 0.01 7712.1 3.82 PASS 3 N N 400 400 400 1.0 2000 2000 1.0 2000 0.02 7681.7 3.81 PASS 4 N N 500 500 500 1.0 2000 2000 1.0 2000 0.03 7940.0 3.98 PASS 5 N N 600 600 600 1.0 2000 2000 1.0 2000 0.05 8089.4 6.15 PASS 6 N N 700 700 700 1.0 2000 2000 1.0 2000 0.08 8113.9 7.90 PASS 7 N N 800 800 800 1.0 2000 2000 1.0 2000 0.13 8124.6 7.69 PASS 8 N N 900 900 900 1.0 2000 2000 1.0 2000 0.18 8004.8 7.53 PASS 9 N N 1000 1000 1000 1.0 2000 2000 1.0 2000 0.24 8268.6 7.96 PASS 10 N N 1100 1100 1100 1.0 2000 2000 1.0 2000 0.32 8360.5 7.84 PASS 11 N N 1200 1200 1200 1.0 2000 2000 1.0 2000 0.42 8292.0 7.62 PASS 12 N N 1300 1300 1300 1.0 2000 2000 1.0 2000 0.52 8372.4 7.71 PASS 13 N N 1400 1400 1400 1.0 2000 2000 1.0 2000 0.65 8425.8 7.67 PASS 14 N N 1500 1500 1500 1.0 2000 2000 1.0 2000 0.80 8445.0 7.69 PASS 15 N N 1600 1600 1600 1.0 2000 2000 1.0 2000 0.98 8394.5 7.61 PASS 16 N N 1700 1700 1700 1.0 2000 2000 1.0 2000 1.16 8444.4 7.62 PASS 17 N N 1800 1800 1800 1.0 2000 2000 1.0 2000 1.38 8472.7 7.66 PASS 18 N N 1900 1900 1900 1.0 2000 2000 1.0 2000 1.61 8501.7 7.68 PASS 19 N N 2000 2000 2000 1.0 2000 2000 1.0 2000 1.89 8471.1 7.26 PASS
With the gnuplot script
set output "bench14.svg"
set title "Compute C + A*B"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-naive-sse-with-intrinsics" using 4:13 with linespoints lt 5 title "demo-naive-sse-with-intrinsics", "demo-naive-sse-with-intrinsics-unrolled" using 4:13 with linespoints lt 6 title "demo-naive-sse-with-intrinsics-unrolled", "demo-sse-intrinsics" using 4:13 with linespoints lt 7 title "demo-sse-intrinsics", "demo-sse-intrinsics-v2" using 4:13 with linespoints lt 8 title "demo-sse-intrinsics-v2", "demo-sse-asm" using 4:13 with linespoints lt 9 title "demo-sse-asm", "demo-sse-asm-unrolled" using 4:13 with linespoints lt 10 title "demo-sse-asm-unrolled", "demo-sse-asm-unrolled-v2" using 4:13 with linespoints lt 11 title "demo-sse-asm-unrolled-v2", "demo-sse-asm-unrolled-v3" using 4:13 with linespoints lt 12 title "demo-sse-asm-unrolled-v3", "demo-sse-all-asm" using 4:13 with linespoints lt 13 title "demo-sse-all-asm", "demo-sse-all-asm-try-prefetching" using 4:13 with linespoints lt 14demo-sse-all-asm-try-prefetching" using 4:13 with linespoints lt 14
we feed gnuplot
$shell> gnuplot bench14.gps
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-naive-sse-with-intrinsics" using 4:13 with linespoints lt 5 title "demo-naive-sse-with-intrinsics", "demo-naive-sse-with-intrinsics-unrolled" using 4:13 with linespoints lt 6 title "demo-naive-sse-with-intrinsics-unrolled", "demo-sse-intrinsics" using 4:13 with linespoints lt 7 title "demo-sse-intrinsics", "demo-sse-intrinsics-v2" using 4:13 with linespoints lt 8 title "demo-sse-intrinsics-v2", "demo-sse-asm" using 4:13 with linespoints lt 9 title "demo-sse-asm", "demo-sse-asm-unrolled" using 4:13 with linespoints lt 10 title "demo-sse-asm-unrolled", "demo-sse-asm-unrolled-v2" using 4:13 with linespoints lt 11 title "demo-sse-asm-unrolled-v2", "demo-sse-asm-unrolled-v3" using 4:13 with linespoints lt 12 title "demo-sse-asm-unrolled-v3", "demo-sse-all-asm" using 4:13 with linespoints lt 13 title "demo-sse-all-asm", "demo-sse-all-asm-try-prefetching" using 4:13 with linespoints lt 14demo-sse-all-asm-try-prefetching" using 4:13 with linespoints lt 14"
^
"bench14.gps", line 9: ';' expected
and get
Second Attempt: Making the Code Size of kb-Loop Body a few Bytes smaller
Check out the demo-sse-all-asm-try-prefetching branch:
$shell> git branch -a demo-naive-sse-with-intrinsics demo-naive-sse-with-intrinsics-unrolled demo-pure-c demo-sse-all-asm * demo-sse-all-asm-try-prefetching demo-sse-asm demo-sse-asm-unrolled demo-sse-asm-unrolled-v2 demo-sse-asm-unrolled-v3 demo-sse-intrinsics demo-sse-intrinsics-v2 demo-sse-intrinsics-v3 master remotes/origin/HEAD -> origin/master remotes/origin/bench-atlas remotes/origin/bench-blis remotes/origin/bench-eigen remotes/origin/bench-mkl remotes/origin/blis-avx-microkernel remotes/origin/demo-naive-avx-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics-unrolled remotes/origin/demo-pure-c remotes/origin/demo-sse-all-asm remotes/origin/demo-sse-all-asm-try-prefetching remotes/origin/demo-sse-all-asm-try-prefetching-v2 remotes/origin/demo-sse-all-asm-with-prefetching remotes/origin/demo-sse-asm remotes/origin/demo-sse-asm-for-AB-loop remotes/origin/demo-sse-asm-unrolled remotes/origin/demo-sse-asm-unrolled-v2 remotes/origin/demo-sse-asm-unrolled-v3 remotes/origin/demo-sse-asm-unrolled-with-prefetch remotes/origin/demo-sse-intrinsics remotes/origin/demo-sse-intrinsics-for-AB-loop remotes/origin/demo-sse-intrinsics-v2 remotes/origin/demo-sse-intrinsics-v3 remotes/origin/demo-with-sse-intrinsics remotes/origin/master remotes/origin/trsm-assignment remotes/origin/trsm-pure-c $shell> git checkout -B demo-sse-all-asm-try-prefetching-v2 remotes/origin/demo-sse-all-asm-try-prefetching-v2 Switched to a new branch 'demo-sse-all-asm-try-prefetching-v2' Branch demo-sse-all-asm-try-prefetching-v2 set up to track remote branch demo-sse-all-asm-try-prefetching-v2 from origin.
Then we compile the project
$shell> make
make -C src
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm_nn.o level3/dgemm_nn.c
ar cru ../libulmblas.a auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
make[1]: Nothing to be done for `all'.
make -C test
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
Code Modifications
-
We replace the assembler instruction movapd with movaps. The instruction does (basically) the same thing but makes the code one byte smaller.
-
We also replace addq with subq. Instead of adding a positive constant we subtract its negative value. This saves another byte.
-
TODO: More details ...
-
TODO: Add link to course material ...
Benchmark Results
We run the benchmarks
$shell> cd bench $shell> ./xdl3blastst -N 100 2000 100 > report
and filter out the results for the demo-sse-all-asm-try-prefetching branch:
$shell> grep PASS report > demo-sse-all-asm-try-prefetching-v2 $shell> cat demo-sse-all-asm-try-prefetching-v2 0 N N 100 100 100 1.0 2000 2000 1.0 2000 0.00 6172.8 3.41 PASS 1 N N 200 200 200 1.0 2000 2000 1.0 2000 0.00 7626.3 3.88 PASS 2 N N 300 300 300 1.0 2000 2000 1.0 2000 0.01 7988.2 3.95 PASS 3 N N 400 400 400 1.0 2000 2000 1.0 2000 0.02 7853.7 3.78 PASS 4 N N 500 500 500 1.0 2000 2000 1.0 2000 0.03 8141.5 3.94 PASS 5 N N 600 600 600 1.0 2000 2000 1.0 2000 0.05 8188.5 7.15 PASS 6 N N 700 700 700 1.0 2000 2000 1.0 2000 0.08 8360.2 8.38 PASS 7 N N 800 800 800 1.0 2000 2000 1.0 2000 0.12 8291.3 7.91 PASS 8 N N 900 900 900 1.0 2000 2000 1.0 2000 0.17 8407.5 7.90 PASS 9 N N 1000 1000 1000 1.0 2000 2000 1.0 2000 0.23 8544.3 7.95 PASS 10 N N 1100 1100 1100 1.0 2000 2000 1.0 2000 0.31 8585.8 7.96 PASS 11 N N 1200 1200 1200 1.0 2000 2000 1.0 2000 0.41 8525.7 7.84 PASS 12 N N 1300 1300 1300 1.0 2000 2000 1.0 2000 0.51 8596.0 7.88 PASS 13 N N 1400 1400 1400 1.0 2000 2000 1.0 2000 0.63 8646.6 7.88 PASS 14 N N 1500 1500 1500 1.0 2000 2000 1.0 2000 0.78 8687.8 7.94 PASS 15 N N 1600 1600 1600 1.0 2000 2000 1.0 2000 0.95 8631.0 7.79 PASS 16 N N 1700 1700 1700 1.0 2000 2000 1.0 2000 1.13 8668.8 7.83 PASS 17 N N 1800 1800 1800 1.0 2000 2000 1.0 2000 1.34 8711.3 7.89 PASS 18 N N 1900 1900 1900 1.0 2000 2000 1.0 2000 1.57 8742.5 7.88 PASS 19 N N 2000 2000 2000 1.0 2000 2000 1.0 2000 1.84 8683.8 7.48 PASS
With the gnuplot script
set output "bench15.svg"
set title "Compute C + A*B"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-naive-sse-with-intrinsics" using 4:13 with linespoints lt 5 title "demo-naive-sse-with-intrinsics", "demo-naive-sse-with-intrinsics-unrolled" using 4:13 with linespoints lt 6 title "demo-naive-sse-with-intrinsics-unrolled", "demo-sse-intrinsics" using 4:13 with linespoints lt 7 title "demo-sse-intrinsics", "demo-sse-intrinsics-v2" using 4:13 with linespoints lt 8 title "demo-sse-intrinsics-v2", "demo-sse-asm" using 4:13 with linespoints lt 9 title "demo-sse-asm", "demo-sse-asm-unrolled" using 4:13 with linespoints lt 10 title "demo-sse-asm-unrolled", "demo-sse-asm-unrolled-v2" using 4:13 with linespoints lt 11 title "demo-sse-asm-unrolled-v2", "demo-sse-asm-unrolled-v3" using 4:13 with linespoints lt 12 title "demo-sse-asm-unrolled-v3", "demo-sse-all-asm" using 4:13 with linespoints lt 13 title "demo-sse-all-asm", "demo-sse-all-asm-try-prefetching" using 4:13 with linespoints lt 14, "demo-sse-all-asm-try-prefetching-v2" using 4:13 with linespoints lt 15
we feed gnuplot
$shell> gnuplot bench15.gps
and get
Third Attempt: Further Improvements
Check out the demo-sse-all-asm-try-prefetching branch:
$shell> git branch -a demo-naive-sse-with-intrinsics demo-naive-sse-with-intrinsics-unrolled demo-pure-c demo-sse-all-asm demo-sse-all-asm-try-prefetching * demo-sse-all-asm-try-prefetching-v2 demo-sse-asm demo-sse-asm-unrolled demo-sse-asm-unrolled-v2 demo-sse-asm-unrolled-v3 demo-sse-intrinsics demo-sse-intrinsics-v2 demo-sse-intrinsics-v3 master remotes/origin/HEAD -> origin/master remotes/origin/bench-atlas remotes/origin/bench-blis remotes/origin/bench-eigen remotes/origin/bench-mkl remotes/origin/blis-avx-microkernel remotes/origin/demo-naive-avx-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics-unrolled remotes/origin/demo-pure-c remotes/origin/demo-sse-all-asm remotes/origin/demo-sse-all-asm-try-prefetching remotes/origin/demo-sse-all-asm-try-prefetching-v2 remotes/origin/demo-sse-all-asm-with-prefetching remotes/origin/demo-sse-asm remotes/origin/demo-sse-asm-for-AB-loop remotes/origin/demo-sse-asm-unrolled remotes/origin/demo-sse-asm-unrolled-v2 remotes/origin/demo-sse-asm-unrolled-v3 remotes/origin/demo-sse-asm-unrolled-with-prefetch remotes/origin/demo-sse-intrinsics remotes/origin/demo-sse-intrinsics-for-AB-loop remotes/origin/demo-sse-intrinsics-v2 remotes/origin/demo-sse-intrinsics-v3 remotes/origin/demo-with-sse-intrinsics remotes/origin/master remotes/origin/trsm-assignment remotes/origin/trsm-pure-c $shell> git checkout -B demo-sse-all-asm-with-prefetching remotes/origin/demo-sse-all-asm-with-prefetching Switched to a new branch 'demo-sse-all-asm-with-prefetching' Branch demo-sse-all-asm-with-prefetching set up to track remote branch demo-sse-all-asm-with-prefetching from origin.
Then we compile the project
$shell> make
make -C src
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm_nn.o level3/dgemm_nn.c
ar cru ../libulmblas.a auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
make[1]: Nothing to be done for `all'.
make -C test
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
Code Modifications
-
We replace the assembler instruction movapd with movaps. The instruction does (basically) the same thing but makes the code one byte smaller.
-
We also replace addq with subq. Instead of adding a positive constant we subtract its negative value. This saves another byte.
-
TODO: More details ...
-
TODO: Add link to course material ...
Final Benchmark
We run the benchmarks
$shell> cd bench $shell> ./xdl3blastst -N 100 2000 100 > report
and filter out the results for the demo-sse-all-asm-try-prefetching branch:
$shell> grep PASS report > demo-sse-all-asm-with-prefetching $shell> cat demo-sse-all-asm-with-prefetching 0 N N 100 100 100 1.0 2000 2000 1.0 2000 0.00 6153.8 3.40 PASS 1 N N 200 200 200 1.0 2000 2000 1.0 2000 0.00 7659.2 3.90 PASS 2 N N 300 300 300 1.0 2000 2000 1.0 2000 0.01 8050.1 3.98 PASS 3 N N 400 400 400 1.0 2000 2000 1.0 2000 0.02 7934.0 3.83 PASS 4 N N 500 500 500 1.0 2000 2000 1.0 2000 0.03 8213.1 3.97 PASS 5 N N 600 600 600 1.0 2000 2000 1.0 2000 0.05 8378.4 7.93 PASS 6 N N 700 700 700 1.0 2000 2000 1.0 2000 0.08 8454.3 8.08 PASS 7 N N 800 800 800 1.0 2000 2000 1.0 2000 0.12 8392.8 8.04 PASS 8 N N 900 900 900 1.0 2000 2000 1.0 2000 0.17 8471.5 8.16 PASS 9 N N 1000 1000 1000 1.0 2000 2000 1.0 2000 0.23 8601.2 8.02 PASS 10 N N 1100 1100 1100 1.0 2000 2000 1.0 2000 0.31 8645.9 7.99 PASS 11 N N 1200 1200 1200 1.0 2000 2000 1.0 2000 0.40 8557.0 7.86 PASS 12 N N 1300 1300 1300 1.0 2000 2000 1.0 2000 0.51 8656.2 7.92 PASS 13 N N 1400 1400 1400 1.0 2000 2000 1.0 2000 0.63 8688.7 7.97 PASS 14 N N 1500 1500 1500 1.0 2000 2000 1.0 2000 0.77 8744.1 8.00 PASS 15 N N 1600 1600 1600 1.0 2000 2000 1.0 2000 0.94 8694.4 7.85 PASS 16 N N 1700 1700 1700 1.0 2000 2000 1.0 2000 1.13 8732.7 7.89 PASS 17 N N 1800 1800 1800 1.0 2000 2000 1.0 2000 1.33 8770.2 7.93 PASS 18 N N 1900 1900 1900 1.0 2000 2000 1.0 2000 1.56 8787.6 7.91 PASS 19 N N 2000 2000 2000 1.0 2000 2000 1.0 2000 1.83 8759.9 7.54 PASS
With the gnuplot script
set output "bench16.svg"
set title "Compute C + A*B"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-naive-sse-with-intrinsics" using 4:13 with linespoints lt 5 title "demo-naive-sse-with-intrinsics", "demo-naive-sse-with-intrinsics-unrolled" using 4:13 with linespoints lt 6 title "demo-naive-sse-with-intrinsics-unrolled", "demo-sse-intrinsics" using 4:13 with linespoints lt 7 title "demo-sse-intrinsics", "demo-sse-intrinsics-v2" using 4:13 with linespoints lt 8 title "demo-sse-intrinsics-v2", "demo-sse-asm" using 4:13 with linespoints lt 9 title "demo-sse-asm", "demo-sse-asm-unrolled" using 4:13 with linespoints lt 10 title "demo-sse-asm-unrolled", "demo-sse-asm-unrolled-v2" using 4:13 with linespoints lt 11 title "demo-sse-asm-unrolled-v2", "demo-sse-asm-unrolled-v3" using 4:13 with linespoints lt 12 title "demo-sse-asm-unrolled-v3", "demo-sse-all-asm" using 4:13 with linespoints lt 13 title "demo-sse-all-asm", "demo-sse-all-asm-try-prefetching" using 4:13 with linespoints lt 14, "demo-sse-all-asm-try-prefetching-v2" using 4:13 with linespoints lt 15, "demo-sse-all-asm-with-prefetching" using 4:13 with linespoints lt 16
we feed gnuplot
$shell> gnuplot bench16.gps
and get