Content |
Pure ANSI C Implementation of DGEMM
In the following we present the cache optimized implementation of the matrix-matrix product. The function dgemm_nn can compute operations of the form \(C \leftarrow \beta C + \alpha A B\). All matrices can have arbitrary row and column stride. That in particular means each matrix can be row or column wise stored. It further means that we also can use the function for the computation of \(C \leftarrow \beta C + \alpha A^T B\), \(C \leftarrow \beta C + \alpha A B^T\) and \(C \leftarrow \beta C + \alpha A^T B^T\).
Select the demo-pure-c Branch
Before switching a branch do a make clean first:
$shell> cd ulmBLAS $shell> make clean for dir in src refblas test bench; do make -C $dir clean; done rm -f auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o rm -f auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o rm -f ../libulmblas.a rm -f ../libatlulmblas.a rm -f caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o rm -f ../librefblas.a rm -f dblat1_ref dblat3_ref dblat1_ulm dblat3_ulm *.SUMM rm -f xdl1blastst libtstatlas.a l1blastst.o ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
Now we switch by checking out the demo-pure-cbranch:
$shell> git branch -a * demo-pure-c master remotes/origin/HEAD -> origin/master remotes/origin/bench-atlas remotes/origin/bench-blis remotes/origin/bench-eigen remotes/origin/bench-mkl remotes/origin/blis-avx-microkernel remotes/origin/demo-naive-avx-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics-unrolled remotes/origin/demo-pure-c remotes/origin/demo-sse-all-asm remotes/origin/demo-sse-all-asm-try-prefetching remotes/origin/demo-sse-all-asm-try-prefetching-v2 remotes/origin/demo-sse-all-asm-with-prefetching remotes/origin/demo-sse-asm remotes/origin/demo-sse-asm-for-AB-loop remotes/origin/demo-sse-asm-unrolled remotes/origin/demo-sse-asm-unrolled-v2 remotes/origin/demo-sse-asm-unrolled-v3 remotes/origin/demo-sse-asm-unrolled-with-prefetch remotes/origin/demo-sse-intrinsics remotes/origin/demo-sse-intrinsics-for-AB-loop remotes/origin/demo-sse-intrinsics-v2 remotes/origin/demo-sse-intrinsics-v3 remotes/origin/demo-with-sse-intrinsics remotes/origin/master remotes/origin/trsm-assignment remotes/origin/trsm-pure-c $shell> git checkout -B demo-pure-c remotes/origin/demo-pure-c Reset branch 'demo-pure-c' Branch demo-pure-c set up to track remote branch demo-pure-c from origin. Your branch is up-to-date with 'origin/demo-pure-c'.
Then we compile the project
$shell> make
make -C src
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o auxiliary/xerbla.o auxiliary/xerbla.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dasum.o level1/dasum.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/daxpy.o level1/daxpy.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dcopy.o level1/dcopy.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/ddot.o level1/ddot.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dnrm2.o level1/dnrm2.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/drot.o level1/drot.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/drotg.o level1/drotg.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/drotm.o level1/drotm.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/drotmg.o level1/drotmg.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dscal.o level1/dscal.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dswap.o level1/dswap.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/idamax.o level1/idamax.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level3/dgemm.o level3/dgemm.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level3/dgemm_nn.o level3/dgemm_nn.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level3/dsymm.o level3/dsymm.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level3/stubs.o level3/stubs.c
ar cru ../libulmblas.a auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o auxiliary/atl_xerbla.o auxiliary/xerbla.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dasum.o level1/dasum.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_daxpy.o level1/daxpy.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dcopy.o level1/dcopy.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_ddot.o level1/ddot.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dnrm2.o level1/dnrm2.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drot.o level1/drot.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotg.o level1/drotg.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotm.o level1/drotm.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotmg.o level1/drotmg.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dscal.o level1/dscal.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dswap.o level1/dswap.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_idamax.o level1/idamax.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dgemm.o level3/dgemm.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dsymm.o level3/dsymm.c
gcc-4.8 -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_stubs.o level3/stubs.c
ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
gfortran -fimplicit-none -O3 -c -o caxpy.o caxpy.f
gfortran -fimplicit-none -O3 -c -o ccopy.o ccopy.f
gfortran -fimplicit-none -O3 -c -o cdotc.o cdotc.f
gfortran -fimplicit-none -O3 -c -o cdotu.o cdotu.f
gfortran -fimplicit-none -O3 -c -o cgbmv.o cgbmv.f
gfortran -fimplicit-none -O3 -c -o cgemm.o cgemm.f
gfortran -fimplicit-none -O3 -c -o cgemv.o cgemv.f
gfortran -fimplicit-none -O3 -c -o cgerc.o cgerc.f
gfortran -fimplicit-none -O3 -c -o cgeru.o cgeru.f
gfortran -fimplicit-none -O3 -c -o chbmv.o chbmv.f
gfortran -fimplicit-none -O3 -c -o chemm.o chemm.f
gfortran -fimplicit-none -O3 -c -o chemv.o chemv.f
gfortran -fimplicit-none -O3 -c -o cher.o cher.f
gfortran -fimplicit-none -O3 -c -o cher2.o cher2.f
gfortran -fimplicit-none -O3 -c -o cher2k.o cher2k.f
gfortran -fimplicit-none -O3 -c -o cherk.o cherk.f
gfortran -fimplicit-none -O3 -c -o chpmv.o chpmv.f
gfortran -fimplicit-none -O3 -c -o chpr.o chpr.f
gfortran -fimplicit-none -O3 -c -o chpr2.o chpr2.f
gfortran -fimplicit-none -O3 -c -o crotg.o crotg.f
gfortran -fimplicit-none -O3 -c -o cscal.o cscal.f
gfortran -fimplicit-none -O3 -c -o csrot.o csrot.f
gfortran -fimplicit-none -O3 -c -o csscal.o csscal.f
gfortran -fimplicit-none -O3 -c -o cswap.o cswap.f
gfortran -fimplicit-none -O3 -c -o csymm.o csymm.f
gfortran -fimplicit-none -O3 -c -o csyr2k.o csyr2k.f
gfortran -fimplicit-none -O3 -c -o csyrk.o csyrk.f
gfortran -fimplicit-none -O3 -c -o ctbmv.o ctbmv.f
gfortran -fimplicit-none -O3 -c -o ctbsv.o ctbsv.f
gfortran -fimplicit-none -O3 -c -o ctpmv.o ctpmv.f
gfortran -fimplicit-none -O3 -c -o ctpsv.o ctpsv.f
gfortran -fimplicit-none -O3 -c -o ctrmm.o ctrmm.f
gfortran -fimplicit-none -O3 -c -o ctrmv.o ctrmv.f
gfortran -fimplicit-none -O3 -c -o ctrsm.o ctrsm.f
gfortran -fimplicit-none -O3 -c -o ctrsv.o ctrsv.f
gfortran -fimplicit-none -O3 -c -o dasum.o dasum.f
gfortran -fimplicit-none -O3 -c -o daxpy.o daxpy.f
gfortran -fimplicit-none -O3 -c -o dcabs1.o dcabs1.f
gfortran -fimplicit-none -O3 -c -o dcopy.o dcopy.f
gfortran -fimplicit-none -O3 -c -o ddot.o ddot.f
gfortran -fimplicit-none -O3 -c -o dgbmv.o dgbmv.f
gfortran -fimplicit-none -O3 -c -o dgemm.o dgemm.f
gfortran -fimplicit-none -O3 -c -o dgemv.o dgemv.f
gfortran -fimplicit-none -O3 -c -o dger.o dger.f
gfortran -fimplicit-none -O3 -c -o dnrm2.o dnrm2.f
gfortran -fimplicit-none -O3 -c -o drot.o drot.f
gfortran -fimplicit-none -O3 -c -o drotg.o drotg.f
gfortran -fimplicit-none -O3 -c -o drotm.o drotm.f
gfortran -fimplicit-none -O3 -c -o drotmg.o drotmg.f
gfortran -fimplicit-none -O3 -c -o dsbmv.o dsbmv.f
gfortran -fimplicit-none -O3 -c -o dscal.o dscal.f
gfortran -fimplicit-none -O3 -c -o dsdot.o dsdot.f
gfortran -fimplicit-none -O3 -c -o dspmv.o dspmv.f
gfortran -fimplicit-none -O3 -c -o dspr.o dspr.f
gfortran -fimplicit-none -O3 -c -o dspr2.o dspr2.f
gfortran -fimplicit-none -O3 -c -o dswap.o dswap.f
gfortran -fimplicit-none -O3 -c -o dsymm.o dsymm.f
gfortran -fimplicit-none -O3 -c -o dsymv.o dsymv.f
gfortran -fimplicit-none -O3 -c -o dsyr.o dsyr.f
gfortran -fimplicit-none -O3 -c -o dsyr2.o dsyr2.f
gfortran -fimplicit-none -O3 -c -o dsyr2k.o dsyr2k.f
gfortran -fimplicit-none -O3 -c -o dsyrk.o dsyrk.f
gfortran -fimplicit-none -O3 -c -o dtbmv.o dtbmv.f
gfortran -fimplicit-none -O3 -c -o dtbsv.o dtbsv.f
gfortran -fimplicit-none -O3 -c -o dtpmv.o dtpmv.f
gfortran -fimplicit-none -O3 -c -o dtpsv.o dtpsv.f
gfortran -fimplicit-none -O3 -c -o dtrmm.o dtrmm.f
gfortran -fimplicit-none -O3 -c -o dtrmv.o dtrmv.f
gfortran -fimplicit-none -O3 -c -o dtrsm.o dtrsm.f
gfortran -fimplicit-none -O3 -c -o dtrsv.o dtrsv.f
gfortran -fimplicit-none -O3 -c -o dzasum.o dzasum.f
gfortran -fimplicit-none -O3 -c -o dznrm2.o dznrm2.f
gfortran -fimplicit-none -O3 -c -o icamax.o icamax.f
gfortran -fimplicit-none -O3 -c -o idamax.o idamax.f
gfortran -fimplicit-none -O3 -c -o isamax.o isamax.f
gfortran -fimplicit-none -O3 -c -o izamax.o izamax.f
gfortran -fimplicit-none -O3 -c -o lsame.o lsame.f
gfortran -fimplicit-none -O3 -c -o sasum.o sasum.f
gfortran -fimplicit-none -O3 -c -o saxpy.o saxpy.f
gfortran -fimplicit-none -O3 -c -o scabs1.o scabs1.f
gfortran -fimplicit-none -O3 -c -o scasum.o scasum.f
gfortran -fimplicit-none -O3 -c -o scnrm2.o scnrm2.f
gfortran -fimplicit-none -O3 -c -o scopy.o scopy.f
gfortran -fimplicit-none -O3 -c -o sdot.o sdot.f
gfortran -fimplicit-none -O3 -c -o sdsdot.o sdsdot.f
gfortran -fimplicit-none -O3 -c -o sgbmv.o sgbmv.f
gfortran -fimplicit-none -O3 -c -o sgemm.o sgemm.f
gfortran -fimplicit-none -O3 -c -o sgemv.o sgemv.f
gfortran -fimplicit-none -O3 -c -o sger.o sger.f
gfortran -fimplicit-none -O3 -c -o snrm2.o snrm2.f
gfortran -fimplicit-none -O3 -c -o srot.o srot.f
gfortran -fimplicit-none -O3 -c -o srotg.o srotg.f
gfortran -fimplicit-none -O3 -c -o srotm.o srotm.f
gfortran -fimplicit-none -O3 -c -o srotmg.o srotmg.f
gfortran -fimplicit-none -O3 -c -o ssbmv.o ssbmv.f
gfortran -fimplicit-none -O3 -c -o sscal.o sscal.f
gfortran -fimplicit-none -O3 -c -o sspmv.o sspmv.f
gfortran -fimplicit-none -O3 -c -o sspr.o sspr.f
gfortran -fimplicit-none -O3 -c -o sspr2.o sspr2.f
gfortran -fimplicit-none -O3 -c -o sswap.o sswap.f
gfortran -fimplicit-none -O3 -c -o ssymm.o ssymm.f
gfortran -fimplicit-none -O3 -c -o ssymv.o ssymv.f
gfortran -fimplicit-none -O3 -c -o ssyr.o ssyr.f
gfortran -fimplicit-none -O3 -c -o ssyr2.o ssyr2.f
gfortran -fimplicit-none -O3 -c -o ssyr2k.o ssyr2k.f
gfortran -fimplicit-none -O3 -c -o ssyrk.o ssyrk.f
gfortran -fimplicit-none -O3 -c -o stbmv.o stbmv.f
gfortran -fimplicit-none -O3 -c -o stbsv.o stbsv.f
gfortran -fimplicit-none -O3 -c -o stpmv.o stpmv.f
gfortran -fimplicit-none -O3 -c -o stpsv.o stpsv.f
gfortran -fimplicit-none -O3 -c -o strmm.o strmm.f
gfortran -fimplicit-none -O3 -c -o strmv.o strmv.f
gfortran -fimplicit-none -O3 -c -o strsm.o strsm.f
gfortran -fimplicit-none -O3 -c -o strsv.o strsv.f
gfortran -fimplicit-none -O3 -c -o xerbla.o xerbla.f
gfortran -fimplicit-none -O3 -c -o xerbla_array.o xerbla_array.f
gfortran -fimplicit-none -O3 -c -o zaxpy.o zaxpy.f
gfortran -fimplicit-none -O3 -c -o zcopy.o zcopy.f
gfortran -fimplicit-none -O3 -c -o zdotc.o zdotc.f
gfortran -fimplicit-none -O3 -c -o zdotu.o zdotu.f
gfortran -fimplicit-none -O3 -c -o zdrot.o zdrot.f
gfortran -fimplicit-none -O3 -c -o zdscal.o zdscal.f
gfortran -fimplicit-none -O3 -c -o zgbmv.o zgbmv.f
gfortran -fimplicit-none -O3 -c -o zgemm.o zgemm.f
gfortran -fimplicit-none -O3 -c -o zgemv.o zgemv.f
gfortran -fimplicit-none -O3 -c -o zgerc.o zgerc.f
gfortran -fimplicit-none -O3 -c -o zgeru.o zgeru.f
gfortran -fimplicit-none -O3 -c -o zhbmv.o zhbmv.f
gfortran -fimplicit-none -O3 -c -o zhemm.o zhemm.f
gfortran -fimplicit-none -O3 -c -o zhemv.o zhemv.f
gfortran -fimplicit-none -O3 -c -o zher.o zher.f
gfortran -fimplicit-none -O3 -c -o zher2.o zher2.f
gfortran -fimplicit-none -O3 -c -o zher2k.o zher2k.f
gfortran -fimplicit-none -O3 -c -o zherk.o zherk.f
gfortran -fimplicit-none -O3 -c -o zhpmv.o zhpmv.f
gfortran -fimplicit-none -O3 -c -o zhpr.o zhpr.f
gfortran -fimplicit-none -O3 -c -o zhpr2.o zhpr2.f
gfortran -fimplicit-none -O3 -c -o zrotg.o zrotg.f
gfortran -fimplicit-none -O3 -c -o zscal.o zscal.f
gfortran -fimplicit-none -O3 -c -o zswap.o zswap.f
gfortran -fimplicit-none -O3 -c -o zsymm.o zsymm.f
gfortran -fimplicit-none -O3 -c -o zsyr2k.o zsyr2k.f
gfortran -fimplicit-none -O3 -c -o zsyrk.o zsyrk.f
gfortran -fimplicit-none -O3 -c -o ztbmv.o ztbmv.f
gfortran -fimplicit-none -O3 -c -o ztbsv.o ztbsv.f
gfortran -fimplicit-none -O3 -c -o ztpmv.o ztpmv.f
gfortran -fimplicit-none -O3 -c -o ztpsv.o ztpsv.f
gfortran -fimplicit-none -O3 -c -o ztrmm.o ztrmm.f
gfortran -fimplicit-none -O3 -c -o ztrmv.o ztrmv.f
gfortran -fimplicit-none -O3 -c -o ztrsm.o ztrsm.f
gfortran -fimplicit-none -O3 -c -o ztrsv.o ztrsv.f
ar cru ../librefblas.a caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o
ranlib ../librefblas.a
make -C test
gfortran dblat1.f -L.. -lrefblas -o dblat1_ref
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lrefblas -o dblat3_ref
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o l1blastst.o l1blastst.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_cputime.o ATL_cputime.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_epsilon.o ATL_epsilon.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77amax.o ATL_f77amax.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77asum.o ATL_f77asum.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77axpy.o ATL_f77axpy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77copy.o ATL_f77copy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77dot.o ATL_f77dot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77gemm.o ATL_f77gemm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77nrm2.o ATL_f77nrm2.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rot.o ATL_f77rot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotg.o ATL_f77rotg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotm.o ATL_f77rotm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotmg.o ATL_f77rotmg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77scal.o ATL_f77scal.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77swap.o ATL_f77swap.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77symm.o ATL_f77symm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syr2k.o ATL_f77syr2k.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syrk.o ATL_f77syrk.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trmm.o ATL_f77trmm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trsm.o ATL_f77trsm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_flushcache.o ATL_flushcache.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gediffnrm1.o ATL_gediffnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gegen.o ATL_gegen.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_genrm1.o ATL_genrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_infnrm.o ATL_infnrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_rand.o ATL_rand.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_set.o ATL_set.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_synrm.o ATL_synrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_trnrm1.o ATL_trnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_vdiff.o ATL_vdiff.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_zero.o ATL_zero.c
gfortran -c -o ATL_df77wrap.o ATL_df77wrap.f
ar r libtstatlas.a ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
ar: creating archive libtstatlas.a
ranlib libtstatlas.a
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
Building Blocks of dgemm_nn
-
pack_MRxk and pack_A are copying row panels from matrix blocks of A into the buffer _A. Details are described in GEMM Packing Matrix A.
-
pack_kxNR and pack_B are copying col panels from matrix blocks of A into the buffer _B. Details are described in GEMM Packing Matrix B.
-
dgemm_micro_kernel multiplies a row panel with a col panel. Details are given in GEMM Micro Kernel.
-
dgemm_macro_kernel multiplies a row panel with a col panel. Details are given in GEMM Macro Kernel.
-
dgemm_nn computes \(C \leftarrow \beta C + \alpha A B\) as described in GEMM.
The Micro Kernel Algorithm
For the sake of simplicity we assume \(m_r = n_r = 4\) in the description of the algorithm. Note that the pure C implementation works as long as \(m_r\) is a divisor or \(m_c\) and \(n_r\) a divisor of \(n_c\).
Recall that A points to a packed (maybe zero padded) row panel of height four and width \(k_c\). That means we have the column wise stored panel
\[A=\begin{pmatrix}a_0 & a_4 & \dots & a_{4 k_c -4} \\a_1 & a_5 & \dots & a_{4 k_c -3} \\a_2 & a_6 & \dots & a_{4 k_c -2} \\a_3 & a_7 & \dots & a_{4 k_c -1} \\\end{pmatrix}\]Further B points to a column panel of height \(k_c\) and width four. That can be illustrated as a row wise stored panel
\[B=\begin{pmatrix}b_0 & b_1 & b_2 & b_3 \\b_4 & b_5 & b_6 & b_7 \\\vdots & \vdots & \vdots & \vdots \\b_{4 k_c-4} & b_{4 k_c-3} & b_{4 k_c-2} & b_{4 k_c-1} \\\end{pmatrix}\]For the product \(A \cdot B\) this means
\[\begin{eqnarray*}A \cdot B&=&\begin{pmatrix}a_0 \\a_1 \\a_2 \\a_3 \\\end{pmatrix}\begin{pmatrix}b_0 & b_1 & b_2 & b_3\end{pmatrix}+\begin{pmatrix}a_4 \\a_5 \\a_6 \\a_7 \\\end{pmatrix}\begin{pmatrix}b_4 & b_5 & b_6 & b_7\end{pmatrix}+\dots+\begin{pmatrix}a_{4 k_c-4} \\a_{4 k_c-3} \\a_{4 k_c-2} \\a_{4 k_c-1} \\\end{pmatrix}\begin{pmatrix}b_{4 k_c-4} & b_{4 k_c-3} & b_{4 k_c-2} & b_{4 k_c-1}\end{pmatrix}\\[1cm]&=&\begin{pmatrix}a_0 b_0 & a_0 b_1 & a_0 b_2 & a_0 b_3 \\a_1 b_0 & a_1 b_1 & a_1 b_2 & a_1 b_3 \\a_2 b_0 & a_2 b_1 & a_2 b_2 & a_2 b_3 \\a_3 b_0 & a_3 b_1 & a_3 b_2 & a_3 b_3 \\\end{pmatrix}+\begin{pmatrix}a_4 b_4 & a_4 b_5 & a_4 b_6 & a_4 b_7 \\a_5 b_4 & a_5 b_5 & a_5 b_6 & a_5 b_7 \\a_6 b_4 & a_6 b_5 & a_6 b_6 & a_6 b_7 \\a_7 b_4 & a_7 b_5 & a_7 b_6 & a_7 b_7 \\\end{pmatrix}+\dots+\begin{pmatrix}a_{4 k_c-4} b_{4 k_c-4} & a_{4 k_c-4} b_{4 k_c-3} & a_{4 k_c-4} b_{4 k_c-2} & a_{4 k_c-4} b_{4 k_c-1} \\a_{4 k_c-3} b_{4 k_c-4} & a_{4 k_c-3} b_{4 k_c-3} & a_{4 k_c-3} b_{4 k_c-2} & a_{4 k_c-3} b_{4 k_c-1} \\a_{4 k_c-2} b_{4 k_c-4} & a_{4 k_c-2} b_{4 k_c-3} & a_{4 k_c-2} b_{4 k_c-2} & a_{4 k_c-2} b_{4 k_c-1} \\a_{4 k_c-1} b_{4 k_c-4} & a_{4 k_c-1} b_{4 k_c-3} & a_{4 k_c-1} b_{4 k_c-2} & a_{4 k_c-1} b_{4 k_c-1} \\\end{pmatrix}\end{eqnarray*}\]We compute \(\mathbf{AB} = A \cdot B\) sequentially:
-
Initialize: \(\mathbf{AB} \leftarrow \mathbf{0}_{4 \times 4}\)
-
For \(l = 0, \dots, k_c-1\) update:
-
\(\mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix} \begin{pmatrix} b_{4l} & b_{4l+1} & b_{4l+2} & b_{4l+3}\end{pmatrix} \)
(Note that this loop is the computational hotspot. In principal the overall performance only depends on the efficient update of \(\mathbf{AB}\))
-
Afterwards we merely update the left hand side micro block \(\tilde{C}\):
-
\(\tilde{C} \leftarrow \beta \tilde{C}\)
-
\(\tilde{C} \leftarrow \tilde{C} + \alpha \mathbf{AB}\)
The dgemm_nn Code (less than 450 lines!)
Benchmark Results
In bench/ we have extracted the benchmarks suite from the ATLAS project:
$shell> cd bench $shell> ./xdl3blastst ./xdl3blastst --------------------------------- GEMM ---------------------------------- TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST ==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== ===== 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 1315.8 1.00 ----- 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 2649.0 2.01 PASS 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.01 1961.3 1.00 ----- 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.00 3283.4 1.67 PASS 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.03 2019.3 1.00 ----- 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.02 3375.6 1.67 PASS 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.06 2066.4 1.00 ----- 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.04 3382.2 1.64 PASS 4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.12 2064.9 1.00 ----- 4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.07 3419.3 1.66 PASS 5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.34 1270.4 1.00 ----- 5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.13 3449.5 2.72 PASS 6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.65 1060.5 1.00 ----- 6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.20 3444.5 3.25 PASS 7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.97 1050.7 1.00 ----- 7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.30 3468.3 3.30 PASS 8 N N 900 900 900 1.0 1000 1000 1.0 1000 1.39 1052.0 1.00 ----- 8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.42 3461.3 3.29 PASS 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 1.72 1164.3 1.00 ----- 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 0.57 3494.0 3.00 PASS 10 tests run, 10 passed
Lines with PASS show results for our own implementation. Lines with ----- belong to the reference implementation.
We can visualize the benchmarks with gnuplot. First we write the results to a report file and use grep to separate results for our implementation and the reference implementation:
$shell> ./xdl3blastst > report $shell> grep PASS report > demo-pure-c $shell> grep "\ \-\-\-\-\-$" report > refBLAS
Then we can use gnuplot to create a svg of the benchmark results:
set output "bench.svg"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set title "Compute C + A*B"
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c"
Feeding gnuplot with this script
$shell> gnuplot bench.gps
creates
Sensitivity to Compilers
Unfortunately different compilers, flags and versions can have a big influence on the performance. Using SSE intrinsics or inline assembly code in the micro kernel solves this problem.
Here I can demonstrate a significant performance drop if I compile with clang instead of gcc-4.8. Note that in other cases my clang compiler performs much better optimizations.
$shell> clang --version Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn) Target: x86_64-apple-darwin13.3.0 Thread model: posix $shell> gcc-4.8 --version gcc-4.8 (Homebrew gcc 4.8.3_1) 4.8.3 Copyright (C) 2013 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $shell> cd src $shell> clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c $shell> make ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o ranlib ../libatlulmblas.a $shell> cd ../bench $shell> make gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a $shell> ./xdl3blastst > report $shell> cat report ./xdl3blastst --------------------------------- GEMM ---------------------------------- TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST ==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== ===== 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 1798.6 1.00 ----- 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 1430.6 0.80 PASS 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.01 1962.0 1.00 ----- 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.01 1501.4 0.77 PASS 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.03 2018.6 1.00 ----- 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.04 1522.0 0.75 PASS 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.06 2027.7 1.00 ----- 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.08 1537.3 0.76 PASS 4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.12 2068.5 1.00 ----- 4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.16 1542.4 0.75 PASS 5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.32 1364.4 1.00 ----- 5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.28 1546.4 1.13 PASS 6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.65 1060.4 1.00 ----- 6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.44 1547.2 1.46 PASS 7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.99 1036.1 1.00 ----- 7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.66 1554.3 1.50 PASS 8 N N 900 900 900 1.0 1000 1000 1.0 1000 1.38 1056.1 1.00 ----- 8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.94 1553.5 1.47 PASS 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 1.72 1164.4 1.00 ----- 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 1.28 1556.4 1.34 PASS 10 tests run, 10 passed
We can visualize this again
$shell> grep PASS report > demo-pure-c-with-clang
By adopting the gnuplot script
set output "bench2.svg"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set title "Compute C + A*B"
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-pure-c-with-clang" using 4:13 with linespoints lt 5 title "demo-pure-c (with clang)"
and feeding gnuplot with this script
$shell> gnuplot bench2.gps
Although the performance drops dramatically we see that it is stable. The is no drop depending on cache sizes like in the refBLAS case. Therefore it still beats the reference implementation if problem sizes are large enough.