Content |
Naive Use of SSE Intrinsics
The implementation presented here uses SSE intrinsics. In my naive way of thinking I expected a compiler to produce this implementation on assembly level when optimizing the demo-pure-c micro kernel. However, no matter what attributes, optimization flags and tricks I used, the compiler never could optimize the demo-pure-c micro kernel to the performance level of this micro kernel.
Select the demo-naive-sse-with-intrinsics Branch
Again, we do a make clean before switching a branch:
$shell> cd ulmBLAS $shell> make clean for dir in src refblas test bench; do make -C $dir clean; done rm -f auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o rm -f auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o rm -f ../libulmblas.a rm -f ../libatlulmblas.a rm -f caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o rm -f ../librefblas.a rm -f dblat1_ref dblat3_ref dblat1_ulm dblat3_ulm *.SUMM rm -f xdl1blastst libtstatlas.a l1blastst.o ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
Then we are checking out the demo-naive-sse-with-intrinsics branch:
$shell> git branch -a * demo-pure-c master remotes/origin/HEAD -> origin/master remotes/origin/bench-atlas remotes/origin/bench-blis remotes/origin/bench-eigen remotes/origin/bench-mkl remotes/origin/blis-avx-microkernel remotes/origin/demo-naive-avx-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics-unrolled remotes/origin/demo-pure-c remotes/origin/demo-sse-all-asm remotes/origin/demo-sse-all-asm-try-prefetching remotes/origin/demo-sse-all-asm-try-prefetching-v2 remotes/origin/demo-sse-all-asm-with-prefetching remotes/origin/demo-sse-asm remotes/origin/demo-sse-asm-for-AB-loop remotes/origin/demo-sse-asm-unrolled remotes/origin/demo-sse-asm-unrolled-v2 remotes/origin/demo-sse-asm-unrolled-v3 remotes/origin/demo-sse-asm-unrolled-with-prefetch remotes/origin/demo-sse-intrinsics remotes/origin/demo-sse-intrinsics-for-AB-loop remotes/origin/demo-sse-intrinsics-v2 remotes/origin/demo-sse-intrinsics-v3 remotes/origin/demo-with-sse-intrinsics remotes/origin/master remotes/origin/trsm-assignment remotes/origin/trsm-pure-c $shell> git checkout -B demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics Switched to a new branch 'demo-naive-sse-with-intrinsics' Branch demo-naive-sse-with-intrinsics set up to track remote branch demo-naive-sse-with-intrinsics from origin.
Then we compile the project
$shell> make
make -C src
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o auxiliary/xerbla.o auxiliary/xerbla.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dasum.o level1/dasum.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/daxpy.o level1/daxpy.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dcopy.o level1/dcopy.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/ddot.o level1/ddot.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dnrm2.o level1/dnrm2.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/drot.o level1/drot.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/drotg.o level1/drotg.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/drotm.o level1/drotm.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/drotmg.o level1/drotmg.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dscal.o level1/dscal.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/dswap.o level1/dswap.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level1/idamax.o level1/idamax.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level3/dgemm.o level3/dgemm.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level3/dgemm_nn.o level3/dgemm_nn.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level3/dsymm.o level3/dsymm.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -c -o level3/stubs.o level3/stubs.c
ar cru ../libulmblas.a auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o auxiliary/atl_xerbla.o auxiliary/xerbla.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dasum.o level1/dasum.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_daxpy.o level1/daxpy.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dcopy.o level1/dcopy.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_ddot.o level1/ddot.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dnrm2.o level1/dnrm2.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drot.o level1/drot.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotg.o level1/drotg.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotm.o level1/drotm.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_drotmg.o level1/drotmg.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dscal.o level1/dscal.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_dswap.o level1/dswap.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level1/atl_idamax.o level1/idamax.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dgemm.o level3/dgemm.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_dsymm.o level3/dsymm.c
gcc-4.8 -Wall -I. -O2 -msse3 -mfpmath=sse -fomit-frame-pointer -DFAKE_ATLAS -c -o level3/atl_stubs.o level3/stubs.c
ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
gfortran -fimplicit-none -O3 -c -o caxpy.o caxpy.f
gfortran -fimplicit-none -O3 -c -o ccopy.o ccopy.f
gfortran -fimplicit-none -O3 -c -o cdotc.o cdotc.f
gfortran -fimplicit-none -O3 -c -o cdotu.o cdotu.f
gfortran -fimplicit-none -O3 -c -o cgbmv.o cgbmv.f
gfortran -fimplicit-none -O3 -c -o cgemm.o cgemm.f
gfortran -fimplicit-none -O3 -c -o cgemv.o cgemv.f
gfortran -fimplicit-none -O3 -c -o cgerc.o cgerc.f
gfortran -fimplicit-none -O3 -c -o cgeru.o cgeru.f
gfortran -fimplicit-none -O3 -c -o chbmv.o chbmv.f
gfortran -fimplicit-none -O3 -c -o chemm.o chemm.f
gfortran -fimplicit-none -O3 -c -o chemv.o chemv.f
gfortran -fimplicit-none -O3 -c -o cher.o cher.f
gfortran -fimplicit-none -O3 -c -o cher2.o cher2.f
gfortran -fimplicit-none -O3 -c -o cher2k.o cher2k.f
gfortran -fimplicit-none -O3 -c -o cherk.o cherk.f
gfortran -fimplicit-none -O3 -c -o chpmv.o chpmv.f
gfortran -fimplicit-none -O3 -c -o chpr.o chpr.f
gfortran -fimplicit-none -O3 -c -o chpr2.o chpr2.f
gfortran -fimplicit-none -O3 -c -o crotg.o crotg.f
gfortran -fimplicit-none -O3 -c -o cscal.o cscal.f
gfortran -fimplicit-none -O3 -c -o csrot.o csrot.f
gfortran -fimplicit-none -O3 -c -o csscal.o csscal.f
gfortran -fimplicit-none -O3 -c -o cswap.o cswap.f
gfortran -fimplicit-none -O3 -c -o csymm.o csymm.f
gfortran -fimplicit-none -O3 -c -o csyr2k.o csyr2k.f
gfortran -fimplicit-none -O3 -c -o csyrk.o csyrk.f
gfortran -fimplicit-none -O3 -c -o ctbmv.o ctbmv.f
gfortran -fimplicit-none -O3 -c -o ctbsv.o ctbsv.f
gfortran -fimplicit-none -O3 -c -o ctpmv.o ctpmv.f
gfortran -fimplicit-none -O3 -c -o ctpsv.o ctpsv.f
gfortran -fimplicit-none -O3 -c -o ctrmm.o ctrmm.f
gfortran -fimplicit-none -O3 -c -o ctrmv.o ctrmv.f
gfortran -fimplicit-none -O3 -c -o ctrsm.o ctrsm.f
gfortran -fimplicit-none -O3 -c -o ctrsv.o ctrsv.f
gfortran -fimplicit-none -O3 -c -o dasum.o dasum.f
gfortran -fimplicit-none -O3 -c -o daxpy.o daxpy.f
gfortran -fimplicit-none -O3 -c -o dcabs1.o dcabs1.f
gfortran -fimplicit-none -O3 -c -o dcopy.o dcopy.f
gfortran -fimplicit-none -O3 -c -o ddot.o ddot.f
gfortran -fimplicit-none -O3 -c -o dgbmv.o dgbmv.f
gfortran -fimplicit-none -O3 -c -o dgemm.o dgemm.f
gfortran -fimplicit-none -O3 -c -o dgemv.o dgemv.f
gfortran -fimplicit-none -O3 -c -o dger.o dger.f
gfortran -fimplicit-none -O3 -c -o dnrm2.o dnrm2.f
gfortran -fimplicit-none -O3 -c -o drot.o drot.f
gfortran -fimplicit-none -O3 -c -o drotg.o drotg.f
gfortran -fimplicit-none -O3 -c -o drotm.o drotm.f
gfortran -fimplicit-none -O3 -c -o drotmg.o drotmg.f
gfortran -fimplicit-none -O3 -c -o dsbmv.o dsbmv.f
gfortran -fimplicit-none -O3 -c -o dscal.o dscal.f
gfortran -fimplicit-none -O3 -c -o dsdot.o dsdot.f
gfortran -fimplicit-none -O3 -c -o dspmv.o dspmv.f
gfortran -fimplicit-none -O3 -c -o dspr.o dspr.f
gfortran -fimplicit-none -O3 -c -o dspr2.o dspr2.f
gfortran -fimplicit-none -O3 -c -o dswap.o dswap.f
gfortran -fimplicit-none -O3 -c -o dsymm.o dsymm.f
gfortran -fimplicit-none -O3 -c -o dsymv.o dsymv.f
gfortran -fimplicit-none -O3 -c -o dsyr.o dsyr.f
gfortran -fimplicit-none -O3 -c -o dsyr2.o dsyr2.f
gfortran -fimplicit-none -O3 -c -o dsyr2k.o dsyr2k.f
gfortran -fimplicit-none -O3 -c -o dsyrk.o dsyrk.f
gfortran -fimplicit-none -O3 -c -o dtbmv.o dtbmv.f
gfortran -fimplicit-none -O3 -c -o dtbsv.o dtbsv.f
gfortran -fimplicit-none -O3 -c -o dtpmv.o dtpmv.f
gfortran -fimplicit-none -O3 -c -o dtpsv.o dtpsv.f
gfortran -fimplicit-none -O3 -c -o dtrmm.o dtrmm.f
gfortran -fimplicit-none -O3 -c -o dtrmv.o dtrmv.f
gfortran -fimplicit-none -O3 -c -o dtrsm.o dtrsm.f
gfortran -fimplicit-none -O3 -c -o dtrsv.o dtrsv.f
gfortran -fimplicit-none -O3 -c -o dzasum.o dzasum.f
gfortran -fimplicit-none -O3 -c -o dznrm2.o dznrm2.f
gfortran -fimplicit-none -O3 -c -o icamax.o icamax.f
gfortran -fimplicit-none -O3 -c -o idamax.o idamax.f
gfortran -fimplicit-none -O3 -c -o isamax.o isamax.f
gfortran -fimplicit-none -O3 -c -o izamax.o izamax.f
gfortran -fimplicit-none -O3 -c -o lsame.o lsame.f
gfortran -fimplicit-none -O3 -c -o sasum.o sasum.f
gfortran -fimplicit-none -O3 -c -o saxpy.o saxpy.f
gfortran -fimplicit-none -O3 -c -o scabs1.o scabs1.f
gfortran -fimplicit-none -O3 -c -o scasum.o scasum.f
gfortran -fimplicit-none -O3 -c -o scnrm2.o scnrm2.f
gfortran -fimplicit-none -O3 -c -o scopy.o scopy.f
gfortran -fimplicit-none -O3 -c -o sdot.o sdot.f
gfortran -fimplicit-none -O3 -c -o sdsdot.o sdsdot.f
gfortran -fimplicit-none -O3 -c -o sgbmv.o sgbmv.f
gfortran -fimplicit-none -O3 -c -o sgemm.o sgemm.f
gfortran -fimplicit-none -O3 -c -o sgemv.o sgemv.f
gfortran -fimplicit-none -O3 -c -o sger.o sger.f
gfortran -fimplicit-none -O3 -c -o snrm2.o snrm2.f
gfortran -fimplicit-none -O3 -c -o srot.o srot.f
gfortran -fimplicit-none -O3 -c -o srotg.o srotg.f
gfortran -fimplicit-none -O3 -c -o srotm.o srotm.f
gfortran -fimplicit-none -O3 -c -o srotmg.o srotmg.f
gfortran -fimplicit-none -O3 -c -o ssbmv.o ssbmv.f
gfortran -fimplicit-none -O3 -c -o sscal.o sscal.f
gfortran -fimplicit-none -O3 -c -o sspmv.o sspmv.f
gfortran -fimplicit-none -O3 -c -o sspr.o sspr.f
gfortran -fimplicit-none -O3 -c -o sspr2.o sspr2.f
gfortran -fimplicit-none -O3 -c -o sswap.o sswap.f
gfortran -fimplicit-none -O3 -c -o ssymm.o ssymm.f
gfortran -fimplicit-none -O3 -c -o ssymv.o ssymv.f
gfortran -fimplicit-none -O3 -c -o ssyr.o ssyr.f
gfortran -fimplicit-none -O3 -c -o ssyr2.o ssyr2.f
gfortran -fimplicit-none -O3 -c -o ssyr2k.o ssyr2k.f
gfortran -fimplicit-none -O3 -c -o ssyrk.o ssyrk.f
gfortran -fimplicit-none -O3 -c -o stbmv.o stbmv.f
gfortran -fimplicit-none -O3 -c -o stbsv.o stbsv.f
gfortran -fimplicit-none -O3 -c -o stpmv.o stpmv.f
gfortran -fimplicit-none -O3 -c -o stpsv.o stpsv.f
gfortran -fimplicit-none -O3 -c -o strmm.o strmm.f
gfortran -fimplicit-none -O3 -c -o strmv.o strmv.f
gfortran -fimplicit-none -O3 -c -o strsm.o strsm.f
gfortran -fimplicit-none -O3 -c -o strsv.o strsv.f
gfortran -fimplicit-none -O3 -c -o xerbla.o xerbla.f
gfortran -fimplicit-none -O3 -c -o xerbla_array.o xerbla_array.f
gfortran -fimplicit-none -O3 -c -o zaxpy.o zaxpy.f
gfortran -fimplicit-none -O3 -c -o zcopy.o zcopy.f
gfortran -fimplicit-none -O3 -c -o zdotc.o zdotc.f
gfortran -fimplicit-none -O3 -c -o zdotu.o zdotu.f
gfortran -fimplicit-none -O3 -c -o zdrot.o zdrot.f
gfortran -fimplicit-none -O3 -c -o zdscal.o zdscal.f
gfortran -fimplicit-none -O3 -c -o zgbmv.o zgbmv.f
gfortran -fimplicit-none -O3 -c -o zgemm.o zgemm.f
gfortran -fimplicit-none -O3 -c -o zgemv.o zgemv.f
gfortran -fimplicit-none -O3 -c -o zgerc.o zgerc.f
gfortran -fimplicit-none -O3 -c -o zgeru.o zgeru.f
gfortran -fimplicit-none -O3 -c -o zhbmv.o zhbmv.f
gfortran -fimplicit-none -O3 -c -o zhemm.o zhemm.f
gfortran -fimplicit-none -O3 -c -o zhemv.o zhemv.f
gfortran -fimplicit-none -O3 -c -o zher.o zher.f
gfortran -fimplicit-none -O3 -c -o zher2.o zher2.f
gfortran -fimplicit-none -O3 -c -o zher2k.o zher2k.f
gfortran -fimplicit-none -O3 -c -o zherk.o zherk.f
gfortran -fimplicit-none -O3 -c -o zhpmv.o zhpmv.f
gfortran -fimplicit-none -O3 -c -o zhpr.o zhpr.f
gfortran -fimplicit-none -O3 -c -o zhpr2.o zhpr2.f
gfortran -fimplicit-none -O3 -c -o zrotg.o zrotg.f
gfortran -fimplicit-none -O3 -c -o zscal.o zscal.f
gfortran -fimplicit-none -O3 -c -o zswap.o zswap.f
gfortran -fimplicit-none -O3 -c -o zsymm.o zsymm.f
gfortran -fimplicit-none -O3 -c -o zsyr2k.o zsyr2k.f
gfortran -fimplicit-none -O3 -c -o zsyrk.o zsyrk.f
gfortran -fimplicit-none -O3 -c -o ztbmv.o ztbmv.f
gfortran -fimplicit-none -O3 -c -o ztbsv.o ztbsv.f
gfortran -fimplicit-none -O3 -c -o ztpmv.o ztpmv.f
gfortran -fimplicit-none -O3 -c -o ztpsv.o ztpsv.f
gfortran -fimplicit-none -O3 -c -o ztrmm.o ztrmm.f
gfortran -fimplicit-none -O3 -c -o ztrmv.o ztrmv.f
gfortran -fimplicit-none -O3 -c -o ztrsm.o ztrsm.f
gfortran -fimplicit-none -O3 -c -o ztrsv.o ztrsv.f
ar cru ../librefblas.a caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o
ranlib ../librefblas.a
make -C test
gfortran dblat1.f -L.. -lrefblas -o dblat1_ref
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lrefblas -o dblat3_ref
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o l1blastst.o l1blastst.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_cputime.o ATL_cputime.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_epsilon.o ATL_epsilon.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77amax.o ATL_f77amax.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77asum.o ATL_f77asum.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77axpy.o ATL_f77axpy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77copy.o ATL_f77copy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77dot.o ATL_f77dot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77gemm.o ATL_f77gemm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77nrm2.o ATL_f77nrm2.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rot.o ATL_f77rot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotg.o ATL_f77rotg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotm.o ATL_f77rotm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotmg.o ATL_f77rotmg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77scal.o ATL_f77scal.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77swap.o ATL_f77swap.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77symm.o ATL_f77symm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syr2k.o ATL_f77syr2k.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syrk.o ATL_f77syrk.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trmm.o ATL_f77trmm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trsm.o ATL_f77trsm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_flushcache.o ATL_flushcache.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gediffnrm1.o ATL_gediffnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gegen.o ATL_gegen.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_genrm1.o ATL_genrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_infnrm.o ATL_infnrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_rand.o ATL_rand.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_set.o ATL_set.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_synrm.o ATL_synrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_trnrm1.o ATL_trnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_vdiff.o ATL_vdiff.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_zero.o ATL_zero.c
gfortran -c -o ATL_df77wrap.o ATL_df77wrap.f
ar r libtstatlas.a ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
ar: creating archive libtstatlas.a
ranlib libtstatlas.a
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
The Micro Kernel Algorithm
We merely optimize the update step
\[\mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix} \begin{pmatrix} b_{4l}, & b_{4l+1}, & b_{4l+2}, & b_{4l+3}\end{pmatrix}\]by using SSE intrinsics. Looking at the original C code
for (j=0; j<NR; ++j) {
for (i=0; i<MR; ++i) {
AB[i+j*MR] += A[i]*B[j];
}
}
A += MR;
B += NR;
}
we notice that in the most inner loop the value B[j] does not change. The natural idea is to compute this step as
\[\mathbf{AB} \leftarrow \mathbf{AB} + \begin{pmatrix} b_{4l} \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix}, & b_{4l+1} \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix}, & b_{4l+2} \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix}, & b_{4l+3} \begin{pmatrix} a_{4l} \\ a_{4l+1} \\ a_{4l+2} \\ a_{4l+3}\end{pmatrix} \end{pmatrix}\]Let \(\mathbb{b}_{00}, \mathbb{b}_{11}, \mathbb{b}_{22}, \mathbb{b}_{33}\) and \(\mathbb{a}_{01}, \mathbb{a}_{23}\) denote SSE registers. We use this 6 registers to store the operands:
\[\begin{array}{llllll}\mathbb{b}_{00} \leftarrow \begin{pmatrix} b_{4l } \\ b_{4l } \end{pmatrix}, &\mathbb{b}_{11} \leftarrow \begin{pmatrix} b_{4l+1} \\ b_{4l+1} \end{pmatrix}, &\mathbb{b}_{22} \leftarrow \begin{pmatrix} b_{4l+2} \\ b_{4l+2} \end{pmatrix}, &\mathbb{b}_{33} \leftarrow \begin{pmatrix} b_{4l+3} \\ b_{4l+3} \end{pmatrix}, &\mathbb{a}_{01} \leftarrow \begin{pmatrix} a_{4l } \\ a_{4l+1} \end{pmatrix}, &\mathbb{a}_{23} \leftarrow \begin{pmatrix} a_{4l+2} \\ a_{4l+3} \end{pmatrix}\end{array}\]Another 8 SSE registers denoted as \(\mathbb{ab}_{\cdot,\cdot}\) are used to represent \(\mathbf{AB}\):
\[\begin{array}{llll}\mathbb{ab}_{00,10} \leftarrow \begin{pmatrix} ab_{0,0} \\ ab_{1,0} \end{pmatrix}, &\mathbb{ab}_{01,11} \leftarrow \begin{pmatrix} ab_{0,1} \\ ab_{1,1} \end{pmatrix}, &\mathbb{ab}_{02,12} \leftarrow \begin{pmatrix} ab_{0,2} \\ ab_{1,2} \end{pmatrix}, &\mathbb{ab}_{03,13} \leftarrow \begin{pmatrix} ab_{0,3} \\ ab_{1,3} \end{pmatrix} \\[0.5cm]\mathbb{ab}_{20,30} \leftarrow \begin{pmatrix} ab_{2,0} \\ ab_{3,0} \end{pmatrix}, &\mathbb{ab}_{21,31} \leftarrow \begin{pmatrix} ab_{2,1} \\ ab_{3,1} \end{pmatrix}, &\mathbb{ab}_{22,32} \leftarrow \begin{pmatrix} ab_{2,2} \\ ab_{3,2} \end{pmatrix}, &\mathbb{ab}_{23,33} \leftarrow \begin{pmatrix} ab_{2,3} \\ ab_{3,3} \end{pmatrix}\end{array}\]As our architecture has a total of 16 SSE registers we have two registers left. We use them for temporary results and denote them as \(\mathbb{tmp}_1\) and \(\mathbb{tmp}_2\).
A single update can now be computed as
-
Update the first column:
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{00,10}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{20,30}\)
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{00}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{00}\)
-
\(\mathbb{ab}_{00,10} \leftarrow \mathbb{ab}_{00,10} + \mathbb{tmp}_1\)
-
\(\mathbb{ab}_{20,30} \leftarrow \mathbb{ab}_{20,30} + \mathbb{tmp}_2\)
-
-
Update the second column:
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{01,11}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{21,31}\)
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{11}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{11}\)
-
\(\mathbb{ab}_{01,11} \leftarrow \mathbb{ab}_{01,11} + \mathbb{tmp}_1\)
-
\(\mathbb{ab}_{21,31} \leftarrow \mathbb{ab}_{21,31} + \mathbb{tmp}_2\)
-
-
Update the third column:
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{02,12}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{22,32}\)
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{22}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{22}\)
-
\(\mathbb{ab}_{02,12} \leftarrow \mathbb{ab}_{02,12} + \mathbb{tmp}_1\)
-
\(\mathbb{ab}_{22,32} \leftarrow \mathbb{ab}_{22,32} + \mathbb{tmp}_2\)
-
-
Update the forth column:
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{ab}_{03,13}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{ab}_{23,33}\)
-
\(\mathbb{tmp}_1 \leftarrow \mathbb{tmp}_1 \odot \mathbb{b}_{33}\)
-
\(\mathbb{tmp}_2 \leftarrow \mathbb{tmp}_2 \odot \mathbb{b}_{33}\)
-
\(\mathbb{ab}_{03,13} \leftarrow \mathbb{ab}_{03,13} + \mathbb{tmp}_1\)
-
\(\mathbb{ab}_{23,33} \leftarrow \mathbb{ab}_{23,33} + \mathbb{tmp}_2\)
-
Hereby \(\odot\) denotes the usual component wise multiplication of SSE registers. We also assume that previous to the first update step all the \(\mathbb{ab}\) registers are zero initialized.
Once we have completed the total of \(k_c\) updates we write the result back to memory into \(\mathbf{AB}\).
The dgemm_nn Code
Note that we also added an attribute for 16 byte alignment to the definition of local buffers _A, _B, _C and AB. Having a 16-byte alignment is required for the load and store intrinsics used in the micro kernel.
Benchmark Results
We run the benchmarks
$shell> cd bench $shell> ./xdl3blastst > report $shell> cat report ./xdl3blastst --------------------------------- GEMM ---------------------------------- TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST ==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== ===== 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 1798.6 1.00 ----- 0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 3898.6 2.17 PASS 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.01 1949.6 1.00 ----- 1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.00 4461.8 2.29 PASS 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.03 2018.2 1.00 ----- 2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.01 4634.0 2.30 PASS 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.06 2017.5 1.00 ----- 3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.03 4601.2 2.28 PASS 4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.14 1837.7 1.00 ----- 4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.05 4680.6 2.55 PASS 5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.32 1337.4 1.00 ----- 5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.09 4740.0 3.54 PASS 6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.70 983.0 1.00 ----- 6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.14 4770.6 4.85 PASS 7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.98 1045.5 1.00 ----- 7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.22 4752.8 4.55 PASS 8 N N 900 900 900 1.0 1000 1000 1.0 1000 1.38 1056.5 1.00 ----- 8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.30 4783.0 4.53 PASS 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 1.72 1160.3 1.00 ----- 9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 0.42 4805.7 4.14 PASS 10 tests run, 10 passed
and filter out the results for the demo-naive-sse-with-intrinsics branch:
$shell> grep PASS report > demo-naive-sse-with-intrinsics
With the gnuplot script
set output "bench3.svg"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set title "Compute C + A*B"
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-naive-sse-with-intrinsics" using 4:13 with linespoints lt 5 title "demo-naive-sse-with-intrinsics"
we feed gnuplot
$shell> gnuplot bench3.gps
and get