Content |
Complete Assembler Micro Kernel
The BLIS micro kernel gains another performance boost through prefetching data. In my experiments I only was able to add the feature after converting the whole mico kernel into assembler.
Status Quo: So far only the loop for computing AB was implemented in assembler. The rest of the micro kernel was left untouched and is plain C code. The bridge between the assembler and C code is the double array AB of length 16. The assembler code copies at the end its results into this array. The remaining C code uses AB to compute C <- beta*C + alpha*A*B.
Having all the micro kernel in assembler removes the need for this internal buffer AB. Note that this will not improve performance. But again, it is a prerequisite for effectively adding prefetching.
It is no surprise that performance does not improve. Using a profiler one can see that only a few milliseconds were not spent on computing AB.
Select the demo-sse-all-asm Branch
Again, we do a make clean before switching a branch:
$shell> cd ulmBLAS $shell> make clean for dir in src refblas test bench; do make -C $dir clean; done rm -f auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o rm -f auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o rm -f ../libulmblas.a rm -f ../libatlulmblas.a rm -f caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o rm -f ../librefblas.a rm -f dblat1_ref dblat3_ref dblat1_ulm dblat3_ulm *.SUMM rm -f xdl1blastst libtstatlas.a l1blastst.o ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
Then we are checking out the demo-sse-all-asm branch:
$shell> git branch -a demo-naive-sse-with-intrinsics demo-naive-sse-with-intrinsics-unrolled demo-pure-c demo-sse-asm demo-sse-asm-unrolled demo-sse-asm-unrolled-v2 * demo-sse-asm-unrolled-v3 demo-sse-intrinsics demo-sse-intrinsics-v2 demo-sse-intrinsics-v3 master remotes/origin/HEAD -> origin/master remotes/origin/bench-atlas remotes/origin/bench-blis remotes/origin/bench-eigen remotes/origin/bench-mkl remotes/origin/blis-avx-microkernel remotes/origin/demo-naive-avx-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics remotes/origin/demo-naive-sse-with-intrinsics-unrolled remotes/origin/demo-pure-c remotes/origin/demo-sse-all-asm remotes/origin/demo-sse-all-asm-try-prefetching remotes/origin/demo-sse-all-asm-try-prefetching-v2 remotes/origin/demo-sse-all-asm-with-prefetching remotes/origin/demo-sse-asm remotes/origin/demo-sse-asm-for-AB-loop remotes/origin/demo-sse-asm-unrolled remotes/origin/demo-sse-asm-unrolled-v2 remotes/origin/demo-sse-asm-unrolled-v3 remotes/origin/demo-sse-asm-unrolled-with-prefetch remotes/origin/demo-sse-intrinsics remotes/origin/demo-sse-intrinsics-for-AB-loop remotes/origin/demo-sse-intrinsics-v2 remotes/origin/demo-sse-intrinsics-v3 remotes/origin/demo-with-sse-intrinsics remotes/origin/master remotes/origin/trsm-assignment remotes/origin/trsm-pure-c $shell> git checkout -B demo-sse-all-asm remotes/origin/demo-sse-all-asm Switched to a new branch 'demo-sse-all-asm' Branch demo-sse-all-asm set up to track remote branch demo-sse-all-asm from origin.
Then we compile the project
$shell> make
make -C src
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o auxiliary/xerbla.o auxiliary/xerbla.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dasum.o level1/dasum.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/daxpy.o level1/daxpy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dcopy.o level1/dcopy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/ddot.o level1/ddot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dnrm2.o level1/dnrm2.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drot.o level1/drot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotg.o level1/drotg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotm.o level1/drotm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/drotmg.o level1/drotmg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dscal.o level1/dscal.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/dswap.o level1/dswap.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level1/idamax.o level1/idamax.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm.o level3/dgemm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dgemm_nn.o level3/dgemm_nn.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/dsymm.o level3/dsymm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -c -o level3/stubs.o level3/stubs.c
ar cru ../libulmblas.a auxiliary/xerbla.o level1/dasum.o level1/daxpy.o level1/dcopy.o level1/ddot.o level1/dnrm2.o level1/drot.o level1/drotg.o level1/drotm.o level1/drotmg.o level1/dscal.o level1/dswap.o level1/idamax.o level3/dgemm.o level3/dgemm_nn.o level3/dsymm.o level3/stubs.o
ranlib ../libulmblas.a
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o auxiliary/atl_xerbla.o auxiliary/xerbla.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dasum.o level1/dasum.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_daxpy.o level1/daxpy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dcopy.o level1/dcopy.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_ddot.o level1/ddot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dnrm2.o level1/dnrm2.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drot.o level1/drot.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotg.o level1/drotg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotm.o level1/drotm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_drotmg.o level1/drotmg.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dscal.o level1/dscal.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_dswap.o level1/dswap.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level1/atl_idamax.o level1/idamax.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm.o level3/dgemm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dgemm_nn.o level3/dgemm_nn.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_dsymm.o level3/dsymm.c
clang -Wall -I. -O3 -msse3 -mfpmath=sse -fomit-frame-pointer -DULM_BLOCKED -DFAKE_ATLAS -c -o level3/atl_stubs.o level3/stubs.c
ar cru ../libatlulmblas.a auxiliary/atl_xerbla.o level1/atl_dasum.o level1/atl_daxpy.o level1/atl_dcopy.o level1/atl_ddot.o level1/atl_dnrm2.o level1/atl_drot.o level1/atl_drotg.o level1/atl_drotm.o level1/atl_drotmg.o level1/atl_dscal.o level1/atl_dswap.o level1/atl_idamax.o level3/atl_dgemm.o level3/atl_dgemm_nn.o level3/atl_dsymm.o level3/atl_stubs.o
ranlib ../libatlulmblas.a
make -C refblas
gfortran -fimplicit-none -O3 -c -o caxpy.o caxpy.f
gfortran -fimplicit-none -O3 -c -o ccopy.o ccopy.f
gfortran -fimplicit-none -O3 -c -o cdotc.o cdotc.f
gfortran -fimplicit-none -O3 -c -o cdotu.o cdotu.f
gfortran -fimplicit-none -O3 -c -o cgbmv.o cgbmv.f
gfortran -fimplicit-none -O3 -c -o cgemm.o cgemm.f
gfortran -fimplicit-none -O3 -c -o cgemv.o cgemv.f
gfortran -fimplicit-none -O3 -c -o cgerc.o cgerc.f
gfortran -fimplicit-none -O3 -c -o cgeru.o cgeru.f
gfortran -fimplicit-none -O3 -c -o chbmv.o chbmv.f
gfortran -fimplicit-none -O3 -c -o chemm.o chemm.f
gfortran -fimplicit-none -O3 -c -o chemv.o chemv.f
gfortran -fimplicit-none -O3 -c -o cher.o cher.f
gfortran -fimplicit-none -O3 -c -o cher2.o cher2.f
gfortran -fimplicit-none -O3 -c -o cher2k.o cher2k.f
gfortran -fimplicit-none -O3 -c -o cherk.o cherk.f
gfortran -fimplicit-none -O3 -c -o chpmv.o chpmv.f
gfortran -fimplicit-none -O3 -c -o chpr.o chpr.f
gfortran -fimplicit-none -O3 -c -o chpr2.o chpr2.f
gfortran -fimplicit-none -O3 -c -o crotg.o crotg.f
gfortran -fimplicit-none -O3 -c -o cscal.o cscal.f
gfortran -fimplicit-none -O3 -c -o csrot.o csrot.f
gfortran -fimplicit-none -O3 -c -o csscal.o csscal.f
gfortran -fimplicit-none -O3 -c -o cswap.o cswap.f
gfortran -fimplicit-none -O3 -c -o csymm.o csymm.f
gfortran -fimplicit-none -O3 -c -o csyr2k.o csyr2k.f
gfortran -fimplicit-none -O3 -c -o csyrk.o csyrk.f
gfortran -fimplicit-none -O3 -c -o ctbmv.o ctbmv.f
gfortran -fimplicit-none -O3 -c -o ctbsv.o ctbsv.f
gfortran -fimplicit-none -O3 -c -o ctpmv.o ctpmv.f
gfortran -fimplicit-none -O3 -c -o ctpsv.o ctpsv.f
gfortran -fimplicit-none -O3 -c -o ctrmm.o ctrmm.f
gfortran -fimplicit-none -O3 -c -o ctrmv.o ctrmv.f
gfortran -fimplicit-none -O3 -c -o ctrsm.o ctrsm.f
gfortran -fimplicit-none -O3 -c -o ctrsv.o ctrsv.f
gfortran -fimplicit-none -O3 -c -o dasum.o dasum.f
gfortran -fimplicit-none -O3 -c -o daxpy.o daxpy.f
gfortran -fimplicit-none -O3 -c -o dcabs1.o dcabs1.f
gfortran -fimplicit-none -O3 -c -o dcopy.o dcopy.f
gfortran -fimplicit-none -O3 -c -o ddot.o ddot.f
gfortran -fimplicit-none -O3 -c -o dgbmv.o dgbmv.f
gfortran -fimplicit-none -O3 -c -o dgemm.o dgemm.f
gfortran -fimplicit-none -O3 -c -o dgemv.o dgemv.f
gfortran -fimplicit-none -O3 -c -o dger.o dger.f
gfortran -fimplicit-none -O3 -c -o dnrm2.o dnrm2.f
gfortran -fimplicit-none -O3 -c -o drot.o drot.f
gfortran -fimplicit-none -O3 -c -o drotg.o drotg.f
gfortran -fimplicit-none -O3 -c -o drotm.o drotm.f
gfortran -fimplicit-none -O3 -c -o drotmg.o drotmg.f
gfortran -fimplicit-none -O3 -c -o dsbmv.o dsbmv.f
gfortran -fimplicit-none -O3 -c -o dscal.o dscal.f
gfortran -fimplicit-none -O3 -c -o dsdot.o dsdot.f
gfortran -fimplicit-none -O3 -c -o dspmv.o dspmv.f
gfortran -fimplicit-none -O3 -c -o dspr.o dspr.f
gfortran -fimplicit-none -O3 -c -o dspr2.o dspr2.f
gfortran -fimplicit-none -O3 -c -o dswap.o dswap.f
gfortran -fimplicit-none -O3 -c -o dsymm.o dsymm.f
gfortran -fimplicit-none -O3 -c -o dsymv.o dsymv.f
gfortran -fimplicit-none -O3 -c -o dsyr.o dsyr.f
gfortran -fimplicit-none -O3 -c -o dsyr2.o dsyr2.f
gfortran -fimplicit-none -O3 -c -o dsyr2k.o dsyr2k.f
gfortran -fimplicit-none -O3 -c -o dsyrk.o dsyrk.f
gfortran -fimplicit-none -O3 -c -o dtbmv.o dtbmv.f
gfortran -fimplicit-none -O3 -c -o dtbsv.o dtbsv.f
gfortran -fimplicit-none -O3 -c -o dtpmv.o dtpmv.f
gfortran -fimplicit-none -O3 -c -o dtpsv.o dtpsv.f
gfortran -fimplicit-none -O3 -c -o dtrmm.o dtrmm.f
gfortran -fimplicit-none -O3 -c -o dtrmv.o dtrmv.f
gfortran -fimplicit-none -O3 -c -o dtrsm.o dtrsm.f
gfortran -fimplicit-none -O3 -c -o dtrsv.o dtrsv.f
gfortran -fimplicit-none -O3 -c -o dzasum.o dzasum.f
gfortran -fimplicit-none -O3 -c -o dznrm2.o dznrm2.f
gfortran -fimplicit-none -O3 -c -o icamax.o icamax.f
gfortran -fimplicit-none -O3 -c -o idamax.o idamax.f
gfortran -fimplicit-none -O3 -c -o isamax.o isamax.f
gfortran -fimplicit-none -O3 -c -o izamax.o izamax.f
gfortran -fimplicit-none -O3 -c -o lsame.o lsame.f
gfortran -fimplicit-none -O3 -c -o sasum.o sasum.f
gfortran -fimplicit-none -O3 -c -o saxpy.o saxpy.f
gfortran -fimplicit-none -O3 -c -o scabs1.o scabs1.f
gfortran -fimplicit-none -O3 -c -o scasum.o scasum.f
gfortran -fimplicit-none -O3 -c -o scnrm2.o scnrm2.f
gfortran -fimplicit-none -O3 -c -o scopy.o scopy.f
gfortran -fimplicit-none -O3 -c -o sdot.o sdot.f
gfortran -fimplicit-none -O3 -c -o sdsdot.o sdsdot.f
gfortran -fimplicit-none -O3 -c -o sgbmv.o sgbmv.f
gfortran -fimplicit-none -O3 -c -o sgemm.o sgemm.f
gfortran -fimplicit-none -O3 -c -o sgemv.o sgemv.f
gfortran -fimplicit-none -O3 -c -o sger.o sger.f
gfortran -fimplicit-none -O3 -c -o snrm2.o snrm2.f
gfortran -fimplicit-none -O3 -c -o srot.o srot.f
gfortran -fimplicit-none -O3 -c -o srotg.o srotg.f
gfortran -fimplicit-none -O3 -c -o srotm.o srotm.f
gfortran -fimplicit-none -O3 -c -o srotmg.o srotmg.f
gfortran -fimplicit-none -O3 -c -o ssbmv.o ssbmv.f
gfortran -fimplicit-none -O3 -c -o sscal.o sscal.f
gfortran -fimplicit-none -O3 -c -o sspmv.o sspmv.f
gfortran -fimplicit-none -O3 -c -o sspr.o sspr.f
gfortran -fimplicit-none -O3 -c -o sspr2.o sspr2.f
gfortran -fimplicit-none -O3 -c -o sswap.o sswap.f
gfortran -fimplicit-none -O3 -c -o ssymm.o ssymm.f
gfortran -fimplicit-none -O3 -c -o ssymv.o ssymv.f
gfortran -fimplicit-none -O3 -c -o ssyr.o ssyr.f
gfortran -fimplicit-none -O3 -c -o ssyr2.o ssyr2.f
gfortran -fimplicit-none -O3 -c -o ssyr2k.o ssyr2k.f
gfortran -fimplicit-none -O3 -c -o ssyrk.o ssyrk.f
gfortran -fimplicit-none -O3 -c -o stbmv.o stbmv.f
gfortran -fimplicit-none -O3 -c -o stbsv.o stbsv.f
gfortran -fimplicit-none -O3 -c -o stpmv.o stpmv.f
gfortran -fimplicit-none -O3 -c -o stpsv.o stpsv.f
gfortran -fimplicit-none -O3 -c -o strmm.o strmm.f
gfortran -fimplicit-none -O3 -c -o strmv.o strmv.f
gfortran -fimplicit-none -O3 -c -o strsm.o strsm.f
gfortran -fimplicit-none -O3 -c -o strsv.o strsv.f
gfortran -fimplicit-none -O3 -c -o xerbla.o xerbla.f
gfortran -fimplicit-none -O3 -c -o xerbla_array.o xerbla_array.f
gfortran -fimplicit-none -O3 -c -o zaxpy.o zaxpy.f
gfortran -fimplicit-none -O3 -c -o zcopy.o zcopy.f
gfortran -fimplicit-none -O3 -c -o zdotc.o zdotc.f
gfortran -fimplicit-none -O3 -c -o zdotu.o zdotu.f
gfortran -fimplicit-none -O3 -c -o zdrot.o zdrot.f
gfortran -fimplicit-none -O3 -c -o zdscal.o zdscal.f
gfortran -fimplicit-none -O3 -c -o zgbmv.o zgbmv.f
gfortran -fimplicit-none -O3 -c -o zgemm.o zgemm.f
gfortran -fimplicit-none -O3 -c -o zgemv.o zgemv.f
gfortran -fimplicit-none -O3 -c -o zgerc.o zgerc.f
gfortran -fimplicit-none -O3 -c -o zgeru.o zgeru.f
gfortran -fimplicit-none -O3 -c -o zhbmv.o zhbmv.f
gfortran -fimplicit-none -O3 -c -o zhemm.o zhemm.f
gfortran -fimplicit-none -O3 -c -o zhemv.o zhemv.f
gfortran -fimplicit-none -O3 -c -o zher.o zher.f
gfortran -fimplicit-none -O3 -c -o zher2.o zher2.f
gfortran -fimplicit-none -O3 -c -o zher2k.o zher2k.f
gfortran -fimplicit-none -O3 -c -o zherk.o zherk.f
gfortran -fimplicit-none -O3 -c -o zhpmv.o zhpmv.f
gfortran -fimplicit-none -O3 -c -o zhpr.o zhpr.f
gfortran -fimplicit-none -O3 -c -o zhpr2.o zhpr2.f
gfortran -fimplicit-none -O3 -c -o zrotg.o zrotg.f
gfortran -fimplicit-none -O3 -c -o zscal.o zscal.f
gfortran -fimplicit-none -O3 -c -o zswap.o zswap.f
gfortran -fimplicit-none -O3 -c -o zsymm.o zsymm.f
gfortran -fimplicit-none -O3 -c -o zsyr2k.o zsyr2k.f
gfortran -fimplicit-none -O3 -c -o zsyrk.o zsyrk.f
gfortran -fimplicit-none -O3 -c -o ztbmv.o ztbmv.f
gfortran -fimplicit-none -O3 -c -o ztbsv.o ztbsv.f
gfortran -fimplicit-none -O3 -c -o ztpmv.o ztpmv.f
gfortran -fimplicit-none -O3 -c -o ztpsv.o ztpsv.f
gfortran -fimplicit-none -O3 -c -o ztrmm.o ztrmm.f
gfortran -fimplicit-none -O3 -c -o ztrmv.o ztrmv.f
gfortran -fimplicit-none -O3 -c -o ztrsm.o ztrsm.f
gfortran -fimplicit-none -O3 -c -o ztrsv.o ztrsv.f
ar cru ../librefblas.a caxpy.o ccopy.o cdotc.o cdotu.o cgbmv.o cgemm.o cgemv.o cgerc.o cgeru.o chbmv.o chemm.o chemv.o cher.o cher2.o cher2k.o cherk.o chpmv.o chpr.o chpr2.o crotg.o cscal.o csrot.o csscal.o cswap.o csymm.o csyr2k.o csyrk.o ctbmv.o ctbsv.o ctpmv.o ctpsv.o ctrmm.o ctrmv.o ctrsm.o ctrsv.o dasum.o daxpy.o dcabs1.o dcopy.o ddot.o dgbmv.o dgemm.o dgemv.o dger.o dnrm2.o drot.o drotg.o drotm.o drotmg.o dsbmv.o dscal.o dsdot.o dspmv.o dspr.o dspr2.o dswap.o dsymm.o dsymv.o dsyr.o dsyr2.o dsyr2k.o dsyrk.o dtbmv.o dtbsv.o dtpmv.o dtpsv.o dtrmm.o dtrmv.o dtrsm.o dtrsv.o dzasum.o dznrm2.o icamax.o idamax.o isamax.o izamax.o lsame.o sasum.o saxpy.o scabs1.o scasum.o scnrm2.o scopy.o sdot.o sdsdot.o sgbmv.o sgemm.o sgemv.o sger.o snrm2.o srot.o srotg.o srotm.o srotmg.o ssbmv.o sscal.o sspmv.o sspr.o sspr2.o sswap.o ssymm.o ssymv.o ssyr.o ssyr2.o ssyr2k.o ssyrk.o stbmv.o stbsv.o stpmv.o stpsv.o strmm.o strmv.o strsm.o strsv.o xerbla.o xerbla_array.o zaxpy.o zcopy.o zdotc.o zdotu.o zdrot.o zdscal.o zgbmv.o zgemm.o zgemv.o zgerc.o zgeru.o zhbmv.o zhemm.o zhemv.o zher.o zher2.o zher2k.o zherk.o zhpmv.o zhpr.o zhpr2.o zrotg.o zscal.o zswap.o zsymm.o zsyr2k.o zsyrk.o ztbmv.o ztbsv.o ztpmv.o ztpsv.o ztrmm.o ztrmv.o ztrsm.o ztrsv.o
ranlib ../librefblas.a
make -C test
gfortran dblat1.f -L.. -lrefblas -o dblat1_ref
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lrefblas -o dblat3_ref
gfortran dblat1.f -L.. -lulmblas -o dblat1_ulm
dblat1.f:215.44:
CALL STEST1(DNRM2(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
dblat1.f:219.44:
CALL STEST1(DASUM(N,SX,INCX),STEMP,STEMP,SFAC)
1
Warning: Rank mismatch in argument 'strue1' at (1) (scalar and rank-1)
gfortran dblat3.f -L.. -lulmblas -o dblat3_ulm
make -C bench
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o l1blastst.o l1blastst.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_cputime.o ATL_cputime.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_epsilon.o ATL_epsilon.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77amax.o ATL_f77amax.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77asum.o ATL_f77asum.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77axpy.o ATL_f77axpy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77copy.o ATL_f77copy.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77dot.o ATL_f77dot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77gemm.o ATL_f77gemm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77nrm2.o ATL_f77nrm2.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rot.o ATL_f77rot.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotg.o ATL_f77rotg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotm.o ATL_f77rotm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77rotmg.o ATL_f77rotmg.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77scal.o ATL_f77scal.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77swap.o ATL_f77swap.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77symm.o ATL_f77symm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syr2k.o ATL_f77syr2k.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77syrk.o ATL_f77syrk.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trmm.o ATL_f77trmm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_f77trsm.o ATL_f77trsm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_flushcache.o ATL_flushcache.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gediffnrm1.o ATL_gediffnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_gegen.o ATL_gegen.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_genrm1.o ATL_genrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_infnrm.o ATL_infnrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_rand.o ATL_rand.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_set.o ATL_set.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_synrm.o ATL_synrm.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_trnrm1.o ATL_trnrm1.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_vdiff.o ATL_vdiff.c
gcc-4.8 -c -DL2SIZE=4194304 -DAdd_ -DF77_INTEGER=int -DStringSunStyle -DATL_SSE2 -DDREAL -c -o ATL_zero.o ATL_zero.c
gfortran -c -o ATL_df77wrap.o ATL_df77wrap.f
ar r libtstatlas.a ATL_cputime.o ATL_epsilon.o ATL_f77amax.o ATL_f77asum.o ATL_f77axpy.o ATL_f77copy.o ATL_f77dot.o ATL_f77gemm.o ATL_f77nrm2.o ATL_f77rot.o ATL_f77rotg.o ATL_f77rotm.o ATL_f77rotmg.o ATL_f77scal.o ATL_f77swap.o ATL_f77symm.o ATL_f77syr2k.o ATL_f77syrk.o ATL_f77trmm.o ATL_f77trsm.o ATL_flushcache.o ATL_gediffnrm1.o ATL_gegen.o ATL_genrm1.o ATL_infnrm.o ATL_rand.o ATL_set.o ATL_synrm.o ATL_trnrm1.o ATL_vdiff.o ATL_zero.o ATL_df77wrap.o
ar: creating archive libtstatlas.a
ranlib libtstatlas.a
gfortran -o xdl1blastst l1blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
gfortran -o xdl3blastst l3blastst.o libtstatlas.a ../libatlulmblas.a ../librefblas.a
Outline of the Modification
-
In the micro kernel we also go from intto long. It is just simpler to use only 64 bit registers.
-
Registers %xmm8, .., %xmm15 are used to hold the result of A*B.
-
Once these registers contain the required values of A*B we begin with the update C <- beta*C + alpha*A*B:
-
%xmm0 is used to hold alpha and %xmm1 for beta.
-
The strides of matrix \(C\) must be multiplied by sizeof(double). This gives the stride in bytes.
-
For each element \(C_{i,j}\) we do the following:
-
Load \(C_{i,j}\).
-
Compute \(\left(\alpha A B\right)_{i,j}\).
-
Compute \(\beta C_{i,j}\)
-
Store \(\beta C_{i,j} + \left(\alpha A B\right)_{i,j}\).
-
-
The dgemm_nn Code
Benchmark Results
We run the benchmarks
$shell> cd bench $shell> ./xdl3blastst > report
and filter out the results for the demo-sse-all-asm branch:
$shell> grep PASS report > demo-sse-all-asm
With the gnuplot script
set output "bench13.svg"
set title "Compute C + A*B"
set xlabel "Matrix dimensions N=M=K"
set ylabel "MFLOPS"
set yrange [0:9600]
set key outside
plot "refBLAS" using 4:13 with linespoints lt 2 title "Netlib RefBLAS", "demo-pure-c" using 4:13 with linespoints lt 4 title "demo-pure-c", "demo-naive-sse-with-intrinsics" using 4:13 with linespoints lt 5 title "demo-naive-sse-with-intrinsics", "demo-naive-sse-with-intrinsics-unrolled" using 4:13 with linespoints lt 6 title "demo-naive-sse-with-intrinsics-unrolled", "demo-sse-intrinsics" using 4:13 with linespoints lt 7 title "demo-sse-intrinsics", "demo-sse-intrinsics-v2" using 4:13 with linespoints lt 8 title "demo-sse-intrinsics-v2", "demo-sse-asm" using 4:13 with linespoints lt 9 title "demo-sse-asm", "demo-sse-asm-unrolled" using 4:13 with linespoints lt 10 title "demo-sse-asm-unrolled", "demo-sse-asm-unrolled-v2" using 4:13 with linespoints lt 11 title "demo-sse-asm-unrolled-v2", "demo-sse-asm-unrolled-v3" using 4:13 with linespoints lt 12 title "demo-sse-asm-unrolled-v3", "demo-sse-all-asm" using 4:13 with linespoints lt 13 title "demo-sse-all-asm"
we feed gnuplot
$shell> gnuplot bench13.gps
and get
Code Size of kb-Loop Body
On Mac OS X you can use otool to get at assembler code from an object file. Moreover you directly can see how many bytes each instruction takes. We will use this to determine the code size of the kb-loop in the micro kernel. So we have to look at the label generated from
and the jump
Here the complete dump of otool
$shell> cd ../src/level3 $shell> otool -dtV dgemm_nn.o dgemm_nn.o: (__TEXT,__text) section _ULM_dgemm_nn: 0000000000000000 pushq %rbp 0000000000000001 pushq %r15 0000000000000003 pushq %r14 0000000000000005 pushq %r13 0000000000000007 pushq %r12 0000000000000009 pushq %rbx 000000000000000a subq $0x288, %rsp 0000000000000011 movsd %xmm1, 0x208(%rsp) 000000000000001a movl %r9d, 0x44(%rsp) 000000000000001f movsd %xmm0, 0x1d0(%rsp) 0000000000000028 leal 0xfff(%rsi), %eax 000000000000002e movq %rsi, %rbx 0000000000000031 movq %rdx, %rsi 0000000000000034 sarl $0x1f, %eax 0000000000000037 shrl $0x14, %eax 000000000000003a xorps %xmm2, %xmm2 000000000000003d ucomisd %xmm2, %xmm0 0000000000000041 movl 0x2e8(%rsp), %r12d 0000000000000049 movl 0x2e0(%rsp), %r15d 0000000000000051 movq 0x2d8(%rsp), %r14 0000000000000059 jne 0x5d 000000000000005b jnp 0x65 000000000000005d testl %esi, %esi 000000000000005f jne 0x110 0000000000000065 testl %ebx, %ebx 0000000000000067 setg %cl 000000000000006a testl %edi, %edi 000000000000006c setg %al 000000000000006f andb %cl, %al 0000000000000071 ucomisd %xmm2, %xmm1 0000000000000075 jne 0x7b 0000000000000077 jp 0x7b 0000000000000079 jmp 0xc8 000000000000007b testb %al, %al 000000000000007d je .DPOSTACCUMULATE1+1034 0000000000000083 xorl %eax, %eax 0000000000000085 xorl %ecx, %ecx 0000000000000087 nopw _ULM_dgemm_nn(%rax,%rax) 0000000000000090 movl %eax, %edx 0000000000000092 movl %edi, %esi 0000000000000094 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000000a0 movslq %edx, %rdx 00000000000000a3 movsd _ULM_dgemm_nn(%r14,%rdx,8), %xmm0 00000000000000a9 mulsd %xmm1, %xmm0 00000000000000ad movsd %xmm0, _ULM_dgemm_nn(%r14,%rdx,8) 00000000000000b3 addl %r15d, %edx 00000000000000b6 decl %esi 00000000000000b8 jne 0xa0 00000000000000ba incl %ecx 00000000000000bc addl %r12d, %eax 00000000000000bf cmpl %ebx, %ecx 00000000000000c1 jne 0x90 00000000000000c3 jmpq .DPOSTACCUMULATE1+1034 00000000000000c8 testb %al, %al 00000000000000ca je .DPOSTACCUMULATE1+1034 00000000000000d0 xorl %eax, %eax 00000000000000d2 xorl %ecx, %ecx 00000000000000d4 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000000e0 movl %eax, %edx 00000000000000e2 movl %edi, %esi 00000000000000e4 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000000f0 movslq %edx, %rdx 00000000000000f3 movq $_ULM_dgemm_nn, _ULM_dgemm_nn(%r14,%rdx,8) 00000000000000fb addl %r15d, %edx 00000000000000fe decl %esi 0000000000000100 jne 0xf0 0000000000000102 incl %ecx 0000000000000104 addl %r12d, %eax 0000000000000107 cmpl %ebx, %ecx 0000000000000109 jne 0xe0 000000000000010b jmpq .DPOSTACCUMULATE1+1034 0000000000000110 movq %rcx, 0x130(%rsp) 0000000000000118 testl %ebx, %ebx 000000000000011a jle .DPOSTACCUMULATE1+1034 0000000000000120 movq %rdi, %r9 0000000000000123 movq %r9, 0x90(%rsp) 000000000000012b leal 0x17f(%r9), %ecx 0000000000000132 movslq %ecx, %rcx 0000000000000135 imulq $0x2aaaaaab, %rcx, %r11 000000000000013c movq %r11, %rcx 000000000000013f shrq $0x3f, %rcx 0000000000000143 sarq $0x26, %r11 0000000000000147 addl %ecx, %r11d 000000000000014a movq %r11, 0x118(%rsp) 0000000000000152 movq %rbx, %rdi 0000000000000155 leal 0xfff(%rdi,%rax), %r10d 000000000000015d sarl $0xc, %r10d 0000000000000161 movq %r10, 0x18(%rsp) 0000000000000166 leal 0x17f(%rsi), %eax 000000000000016c cltq 000000000000016e imulq $0x2aaaaaab, %rax, %rbx 0000000000000175 movq %rbx, %rax 0000000000000178 shrq $0x3f, %rax 000000000000017c sarq $0x26, %rbx 0000000000000180 addl %eax, %ebx 0000000000000182 movq %rbx, 0x80(%rsp) 000000000000018a movslq %r9d, %rdx 000000000000018d imulq $0x2aaaaaab, %rdx, %rax 0000000000000194 movq %rax, %rcx 0000000000000197 shrq $0x3f, %rcx 000000000000019b sarq $0x26, %rax 000000000000019f addl %ecx, %eax 00000000000001a1 imull $0x180, %eax, %eax 00000000000001a7 subl %eax, %edx 00000000000001a9 movq %rdx, 0x128(%rsp) 00000000000001b1 movl %edi, %eax 00000000000001b3 sarl $0x1f, %eax 00000000000001b6 shrl $0x14, %eax 00000000000001b9 addl %edi, %eax 00000000000001bb andl $0xfffff000, %eax 00000000000001c0 subl %eax, %edi 00000000000001c2 movq %rdi, 0x28(%rsp) 00000000000001c7 movslq %esi, %rdx 00000000000001ca imulq $0x2aaaaaab, %rdx, %rax 00000000000001d1 movq %rax, %rcx 00000000000001d4 shrq $0x3f, %rcx 00000000000001d8 sarq $0x26, %rax 00000000000001dc addl %ecx, %eax 00000000000001de imull $0x180, %eax, %eax 00000000000001e4 subl %eax, %edx 00000000000001e6 movq %rdx, 0xa0(%rsp) 00000000000001ee movl 0x2d0(%rsp), %ebp 00000000000001f5 leal -0x1(%r10), %eax 00000000000001f9 movl %eax, 0x24(%rsp) 00000000000001fd leal -0x1(%rbx), %eax 0000000000000200 movl %eax, 0x9c(%rsp) 0000000000000207 movslq 0x2c8(%rsp), %rdx 000000000000020f movq %rdx, 0xc0(%rsp) 0000000000000217 leal _ULM_dgemm_nn(,%rbp,4), %r10d 000000000000021f leal -0x1(%r11), %ecx 0000000000000223 movl %ecx, 0x124(%rsp) 000000000000022a movslq 0x44(%rsp), %r13 000000000000022f movq %r13, 0x168(%rsp) 0000000000000237 movq %r8, 0x170(%rsp) 000000000000023f leal _ULM_dgemm_nn(,%r8,4), %r9d 0000000000000247 movslq %r15d, %rcx 000000000000024a movq %rcx, 0x1c8(%rsp) 0000000000000252 movslq %r12d, %rcx 0000000000000255 movq %rcx, 0x1c0(%rsp) 000000000000025d movl %ebp, %ecx 000000000000025f imull $0x180, %edx, %eax 0000000000000265 movl %eax, 0x7c(%rsp) 0000000000000269 movslq %r10d, %rax 000000000000026c movq %rax, 0x8(%rsp) 0000000000000271 leaq _ULM_dgemm_nn(,%rax,8), %rax 0000000000000279 movq %rax, 0xf8(%rsp) 0000000000000281 leaq _ULM_dgemm_nn(,%rdx,8), %rdi 0000000000000289 movq %rdi, 0x70(%rsp) 000000000000028e leal _ULM_dgemm_nn(%rbp,%rbp,2), %r11d 0000000000000293 leal _ULM_dgemm_nn(%rbp,%rbp), %edx 0000000000000297 movslq %ebp, %rbx 000000000000029a movq %rbx, 0x68(%rsp) 000000000000029f imull $0x180, %r13d, %eax 00000000000002a6 movl %eax, 0x64(%rsp) 00000000000002aa imull $0x180, %r8d, %eax 00000000000002b1 movl %eax, 0x114(%rsp) 00000000000002b8 movslq %r9d, %rax 00000000000002bb movq %rax, 0xe0(%rsp) 00000000000002c3 leaq _ULM_dgemm_nn(,%rax,8), %rax 00000000000002cb movq %rax, 0xd8(%rsp) 00000000000002d3 leaq _ULM_dgemm_nn(,%r13,8), %rax 00000000000002db movq %rax, 0x140(%rsp) 00000000000002e3 leal _ULM_dgemm_nn(%r8,%r8,2), %r9d 00000000000002e7 movq %rsi, %rax 00000000000002ea movq %rax, 0x88(%rsp) 00000000000002f2 leal _ULM_dgemm_nn(%r8,%r8), %esi 00000000000002f6 movslq %r8d, %rbp 00000000000002f9 movq %rbp, 0x138(%rsp) 0000000000000301 xorl %ebp, %ebp 0000000000000303 movq %rbp, 0x38(%rsp) 0000000000000308 shll $0xc, %ecx 000000000000030b movl %ecx, 0x14(%rsp) 000000000000030f xorl %ecx, %ecx 0000000000000311 movslq %r11d, %r11 0000000000000314 movq %r11, 0x58(%rsp) 0000000000000319 movslq %edx, %rdx 000000000000031c movq %rdx, 0xf0(%rsp) 0000000000000324 movslq %r9d, %rdx 0000000000000327 movq %rdx, 0x180(%rsp) 000000000000032f movslq %esi, %rdx 0000000000000332 movq %rdx, 0x178(%rsp) 000000000000033a movq %rax, %rsi 000000000000033d nopl _ULM_dgemm_nn(%rax) 0000000000000340 movl %ecx, 0x34(%rsp) 0000000000000344 movq 0x28(%rsp), %rdx 0000000000000349 testl %edx, %edx 000000000000034b sete %al 000000000000034e cmpl 0x24(%rsp), %ecx 0000000000000352 setne %cl 0000000000000355 orb %al, %cl 0000000000000357 movl $0x1000, %eax 000000000000035c cmovnel %eax, %edx 000000000000035f movq %rdx, %rcx 0000000000000362 testl %esi, %esi 0000000000000364 movq %rdi, %rdx 0000000000000367 jle .DPOSTACCUMULATE1+998 000000000000036d movl 0x34(%rsp), %edi 0000000000000371 shll $0xc, %edi 0000000000000374 movl %edi, %esi 0000000000000376 movl 0x2d0(%rsp), %eax 000000000000037d imull %eax, %esi 0000000000000380 movl %esi, 0xa8(%rsp) 0000000000000387 movq %rcx, %rsi 000000000000038a movq %rsi, 0x148(%rsp) 0000000000000392 movl %esi, %eax 0000000000000394 sarl $0x1f, %eax 0000000000000397 shrl $0x1e, %eax 000000000000039a addl %esi, %eax 000000000000039c movl %eax, %ecx 000000000000039e sarl $0x2, %ecx 00000000000003a1 movq %rcx, 0x100(%rsp) 00000000000003a9 andl $-0x4, %eax 00000000000003ac movl %esi, %r8d 00000000000003af subl %eax, %r8d 00000000000003b2 movq %r8, 0x1a0(%rsp) 00000000000003ba negl %eax 00000000000003bc leal -0x1(%rcx), %ecx 00000000000003bf cmpl $0x7, %esi 00000000000003c2 leaq 0x8(,%rcx,8), %rbp 00000000000003ca movl $0x8, %ecx 00000000000003cf cmovleq %rcx, %rbp 00000000000003d3 movq %rbp, 0x50(%rsp) 00000000000003d8 imulq 0x8(%rsp), %rbp 00000000000003de movq %rbp, 0x48(%rsp) 00000000000003e3 movslq %r8d, %rcx 00000000000003e6 movq %rcx, 0xd0(%rsp) 00000000000003ee leal 0x1(%rsi,%rax), %eax 00000000000003f2 cmpl $0x4, %eax 00000000000003f5 movl $0x3, %eax 00000000000003fa cmovgl %r8d, %eax 00000000000003fe subl %ecx, %eax 0000000000000400 leaq 0x8(,%rax,8), %rax 0000000000000408 movq %rax, 0xc8(%rsp) 0000000000000410 imull %r12d, %edi 0000000000000414 movl %edi, 0xec(%rsp) 000000000000041b leal 0x3(%rsi), %eax 000000000000041e sarl $0x1f, %eax 0000000000000421 shrl $0x1e, %eax 0000000000000424 leal 0x3(%rsi,%rax), %eax 0000000000000428 sarl $0x2, %eax 000000000000042b movq %rax, 0x190(%rsp) 0000000000000433 leal -0x1(%rax), %eax 0000000000000436 movl %eax, 0x18c(%rsp) 000000000000043d xorl %eax, %eax 000000000000043f movq %rax, 0xb8(%rsp) 0000000000000447 movq 0x38(%rsp), %rax 000000000000044c movl %eax, %r13d 000000000000044f xorl %edi, %edi 0000000000000451 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000460 movq 0xa0(%rsp), %rsi 0000000000000468 testl %esi, %esi 000000000000046a sete %al 000000000000046d cmpl 0x9c(%rsp), %edi 0000000000000474 setne %cl 0000000000000477 orb %al, %cl 0000000000000479 movl $0x180, %eax 000000000000047e cmovnel %eax, %esi 0000000000000481 movq %rsi, 0x1b0(%rsp) 0000000000000489 testl %edi, %edi 000000000000048b je 0x498 000000000000048d movsd 0x139b(%rip), %xmm0 0000000000000495 movaps %xmm0, %xmm1 0000000000000498 imull $0x180, %edi, %eax 000000000000049e movl %eax, 0x154(%rsp) 00000000000004a5 movq %rdi, 0xb0(%rsp) 00000000000004ad imull 0x2c8(%rsp), %eax 00000000000004b5 addl 0xa8(%rsp), %eax 00000000000004bc movq 0x148(%rsp), %rcx 00000000000004c4 cmpl $0x4, %ecx 00000000000004c7 cltq 00000000000004c9 movq 0x2c0(%rsp), %rcx 00000000000004d1 leaq _ULM_dgemm_nn(%rcx,%rax,8), %r12 00000000000004d5 jl 0x5e0 00000000000004db movq 0x1b0(%rsp), %rcx 00000000000004e3 leal _ULM_dgemm_nn(,%rcx,4), %eax 00000000000004ea cltq 00000000000004ec movq %rax, 0x238(%rsp) 00000000000004f4 testl %ecx, %ecx 00000000000004f6 jle 0x5a0 00000000000004fc movslq %r13d, %rax 00000000000004ff movq 0x2c0(%rsp), %rcx 0000000000000507 leaq _ULM_dgemm_nn(%rcx,%rax,8), %r9 000000000000050b movq 0x238(%rsp), %rax 0000000000000513 leaq _ULM_dgemm_nn(,%rax,8), %r8 000000000000051b xorl %eax, %eax 000000000000051d leaq __B(%rip), %rsi 0000000000000524 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000530 movq %r9, %rdi 0000000000000533 xorl %ebp, %ebp 0000000000000535 movq 0x1b0(%rsp), %rcx 000000000000053d movq 0xf0(%rsp), %r10 0000000000000545 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000550 movsd _ULM_dgemm_nn(%rdi), %xmm0 0000000000000554 movsd %xmm0, _ULM_dgemm_nn(%rsi,%rbp) 0000000000000559 movsd _ULM_dgemm_nn(%rdi,%rbx,8), %xmm0 000000000000055e movsd %xmm0, 0x8(%rsi,%rbp) 0000000000000564 movsd _ULM_dgemm_nn(%rdi,%r10,8), %xmm0 000000000000056a movsd %xmm0, 0x10(%rsi,%rbp) 0000000000000570 movsd _ULM_dgemm_nn(%rdi,%r11,8), %xmm0 0000000000000576 movsd %xmm0, 0x18(%rsi,%rbp) 000000000000057c addq $0x20, %rbp 0000000000000580 addq %rdx, %rdi 0000000000000583 decl %ecx 0000000000000585 jne 0x550 0000000000000587 incl %eax 0000000000000589 addq %r8, %rsi 000000000000058c addq 0xf8(%rsp), %r9 0000000000000594 movq 0x100(%rsp), %rcx 000000000000059c cmpl %ecx, %eax 000000000000059e jl 0x530 00000000000005a0 movl %r13d, 0xac(%rsp) 00000000000005a8 movq 0x238(%rsp), %rax 00000000000005b0 movq %rax, %rcx 00000000000005b3 imulq 0x50(%rsp), %rcx 00000000000005b9 leaq __B(%rip), %rax 00000000000005c0 addq %rax, %rcx 00000000000005c3 movq %rcx, 0x238(%rsp) 00000000000005cb addq 0x48(%rsp), %r12 00000000000005d0 movq 0x1a0(%rsp), %r8 00000000000005d8 jmp 0x5f7 00000000000005da nopw _ULM_dgemm_nn(%rax,%rax) 00000000000005e0 movl %r13d, 0xac(%rsp) 00000000000005e8 leaq __B(%rip), %rax 00000000000005ef movq %rax, 0x238(%rsp) 00000000000005f7 movsd %xmm1, 0x1e8(%rsp) 0000000000000600 testl %r8d, %r8d 0000000000000603 jle 0x6b0 0000000000000609 movq 0x1b0(%rsp), %rax 0000000000000611 testl %eax, %eax 0000000000000613 movl $_ULM_dgemm_nn, %r13d 0000000000000619 movq 0x238(%rsp), %rbp 0000000000000621 jle 0x6b0 0000000000000627 nopw _ULM_dgemm_nn(%rax,%rax) 0000000000000630 movq 0xd0(%rsp), %rax 0000000000000638 leaq _ULM_dgemm_nn(%rax,%r13,4), %rax 000000000000063c movq 0x238(%rsp), %rcx 0000000000000644 leaq _ULM_dgemm_nn(%rcx,%rax,8), %rdi 0000000000000648 xorl %eax, %eax 000000000000064a xorl %ecx, %ecx 000000000000064c movl 0x2d0(%rsp), %edx 0000000000000653 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000660 cltq 0000000000000662 movsd _ULM_dgemm_nn(%r12,%rax,8), %xmm0 0000000000000668 movsd %xmm0, _ULM_dgemm_nn(%rbp,%rcx,8) 000000000000066e incq %rcx 0000000000000671 addl %edx, %eax 0000000000000673 cmpl %ecx, %r8d 0000000000000676 jne 0x660 0000000000000678 movq 0xc8(%rsp), %rsi 0000000000000680 movq %r8, %rbx 0000000000000683 callq ___bzero 0000000000000688 movq %rbx, %r8 000000000000068b addq $0x20, %rbp 000000000000068f movq 0xc0(%rsp), %rax 0000000000000697 leaq _ULM_dgemm_nn(%r12,%rax,8), %r12 000000000000069b incq %r13 000000000000069e movq 0x1b0(%rsp), %rax 00000000000006a6 cmpl %eax, %r13d 00000000000006a9 jne 0x630 00000000000006ab nopl _ULM_dgemm_nn(%rax,%rax) 00000000000006b0 movq 0x90(%rsp), %rax 00000000000006b8 testl %eax, %eax 00000000000006ba movsd 0x208(%rsp), %xmm1 00000000000006c3 movq 0x140(%rsp), %rbp 00000000000006cb movq 0x138(%rsp), %rbx 00000000000006d3 jle .DPOSTACCUMULATE1+897 00000000000006d9 movl 0x154(%rsp), %eax 00000000000006e0 imull 0x44(%rsp), %eax 00000000000006e5 movl %eax, 0x154(%rsp) 00000000000006ec movq 0x1b0(%rsp), %rax 00000000000006f4 leal _ULM_dgemm_nn(,%rax,4), %ecx 00000000000006fb movl %ecx, 0x22c(%rsp) 0000000000000702 movslq %ecx, %rcx 0000000000000705 movq %rcx, 0x108(%rsp) 000000000000070d movslq %eax, %rdx 0000000000000710 movq %rdx, %rax 0000000000000713 sarq $0x3f, %rax 0000000000000717 shrq $0x3e, %rax 000000000000071b addq %rdx, %rax 000000000000071e movq %rax, %rsi 0000000000000721 sarq $0x2, %rsi 0000000000000725 movq %rsi, 0x1d8(%rsp) 000000000000072d andq $-0x4, %rax 0000000000000731 subq %rax, %rdx 0000000000000734 movq %rdx, 0x1e0(%rsp) 000000000000073c leaq _ULM_dgemm_nn(,%rcx,8), %rax 0000000000000744 movq %rax, 0x198(%rsp) 000000000000074c movq 0xb8(%rsp), %rax 0000000000000754 movl %eax, %esi 0000000000000756 xorl %edi, %edi 0000000000000758 nopl _ULM_dgemm_nn(%rax,%rax) 0000000000000760 movq 0x128(%rsp), %rdx 0000000000000768 testl %edx, %edx 000000000000076a sete %al 000000000000076d cmpl 0x124(%rsp), %edi 0000000000000774 setne %cl 0000000000000777 orb %al, %cl 0000000000000779 movl $0x180, %eax 000000000000077e cmovnel %eax, %edx 0000000000000781 imull $0x180, %edi, %r8d 0000000000000788 movl %r8d, %eax 000000000000078b movq 0x170(%rsp), %rcx 0000000000000793 imull %ecx, %eax 0000000000000796 addl 0x154(%rsp), %eax 000000000000079d movl %edx, %ecx 000000000000079f sarl $0x1f, %ecx 00000000000007a2 shrl $0x1e, %ecx 00000000000007a5 addl %edx, %ecx 00000000000007a7 cmpl $0x4, %edx 00000000000007aa cltq 00000000000007ac movq 0x130(%rsp), %r9 00000000000007b4 leaq _ULM_dgemm_nn(%r9,%rax,8), %r13 00000000000007b8 jl 0x8f0 00000000000007be movl %r8d, 0x21c(%rsp) 00000000000007c6 movq %rdi, 0x158(%rsp) 00000000000007ce movslq %esi, %rax 00000000000007d1 movl %esi, 0x164(%rsp) 00000000000007d8 movq %rdx, 0x1a8(%rsp) 00000000000007e0 leaq _ULM_dgemm_nn(%r9,%rax,8), %r12 00000000000007e4 movl %ecx, %r10d 00000000000007e7 movl %ecx, 0x230(%rsp) 00000000000007ee sarl $0x2, %r10d 00000000000007f2 leal -0x1(%r10), %eax 00000000000007f6 cmpl $0x7, %edx 00000000000007f9 leaq 0x8(,%rax,8), %rcx 0000000000000801 movl $0x8, %eax 0000000000000806 cmovleq %rax, %rcx 000000000000080a movq %rcx, %rax 000000000000080d imulq 0xe0(%rsp), %rax 0000000000000816 movq %rax, 0x220(%rsp) 000000000000081e imulq 0x108(%rsp), %rcx 0000000000000827 movq %rcx, 0x238(%rsp) 000000000000082f xorl %edi, %edi 0000000000000831 leaq __A(%rip), %rsi 0000000000000838 movq 0xd8(%rsp), %r11 0000000000000840 movq 0x1b0(%rsp), %rdx 0000000000000848 testl %edx, %edx 000000000000084a movq %r12, %rcx 000000000000084d movl $_ULM_dgemm_nn, %eax 0000000000000852 movq 0x180(%rsp), %r8 000000000000085a movq 0x178(%rsp), %r9 0000000000000862 jle 0x8a7 0000000000000864 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000870 movsd _ULM_dgemm_nn(%rcx), %xmm0 0000000000000874 movsd %xmm0, _ULM_dgemm_nn(%rsi,%rax) 0000000000000879 movsd _ULM_dgemm_nn(%rcx,%rbx,8), %xmm0 000000000000087e movsd %xmm0, 0x8(%rsi,%rax) 0000000000000884 movsd _ULM_dgemm_nn(%rcx,%r9,8), %xmm0 000000000000088a movsd %xmm0, 0x10(%rsi,%rax) 0000000000000890 movsd _ULM_dgemm_nn(%rcx,%r8,8), %xmm0 0000000000000896 movsd %xmm0, 0x18(%rsi,%rax) 000000000000089c addq $0x20, %rax 00000000000008a0 addq %rbp, %rcx 00000000000008a3 decl %edx 00000000000008a5 jne 0x870 00000000000008a7 incl %edi 00000000000008a9 addq 0x198(%rsp), %rsi 00000000000008b1 addq %r11, %r12 00000000000008b4 cmpl %r10d, %edi 00000000000008b7 jl 0x840 00000000000008b9 addq 0x220(%rsp), %r13 00000000000008c1 leaq __A(%rip), %rax 00000000000008c8 addq %rax, 0x238(%rsp) 00000000000008d0 movq 0x1a8(%rsp), %rdx 00000000000008d8 movl 0x230(%rsp), %ecx 00000000000008df jmp 0x916 00000000000008e1 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000008f0 movl %r8d, 0x21c(%rsp) 00000000000008f8 movq %rdi, 0x158(%rsp) 0000000000000900 movl %esi, 0x164(%rsp) 0000000000000907 leaq __A(%rip), %rax 000000000000090e movq %rax, 0x238(%rsp) 0000000000000916 movq %rdx, 0x1a8(%rsp) 000000000000091e andl $-0x4, %ecx 0000000000000921 subl %ecx, %edx 0000000000000923 movq %rdx, 0x210(%rsp) 000000000000092b testl %edx, %edx 000000000000092d jle 0xa10 0000000000000933 movq 0x1b0(%rsp), %rax 000000000000093b testl %eax, %eax 000000000000093d jle 0xa10 0000000000000943 movq 0x210(%rsp), %rcx 000000000000094b movslq %ecx, %rax 000000000000094e movq %rax, 0x230(%rsp) 0000000000000956 leal 0x1(%rcx), %eax 0000000000000959 cmpl $0x4, %eax 000000000000095c movl $0x3, %eax 0000000000000961 cmovgl %ecx, %eax 0000000000000964 subl %ecx, %eax 0000000000000966 leaq 0x8(,%rax,8), %rax 000000000000096e movq %rax, 0x220(%rsp) 0000000000000976 movl %ecx, %r12d 0000000000000979 xorl %ebp, %ebp 000000000000097b movq 0x238(%rsp), %rbx 0000000000000983 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000990 movq 0x230(%rsp), %rax 0000000000000998 leaq _ULM_dgemm_nn(%rax,%rbp,4), %rax 000000000000099c movq 0x238(%rsp), %rcx 00000000000009a4 leaq _ULM_dgemm_nn(%rcx,%rax,8), %rdi 00000000000009a8 xorl %eax, %eax 00000000000009aa xorl %ecx, %ecx 00000000000009ac movq 0x170(%rsp), %rdx 00000000000009b4 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000009c0 cltq 00000000000009c2 movsd _ULM_dgemm_nn(%r13,%rax,8), %xmm0 00000000000009c9 movsd %xmm0, _ULM_dgemm_nn(%rbx,%rcx,8) 00000000000009ce incq %rcx 00000000000009d1 addl %edx, %eax 00000000000009d3 cmpl %ecx, %r12d 00000000000009d6 jne 0x9c0 00000000000009d8 movq 0x220(%rsp), %rsi 00000000000009e0 callq ___bzero 00000000000009e5 addq $0x20, %rbx 00000000000009e9 movq 0x168(%rsp), %rax 00000000000009f1 leaq _ULM_dgemm_nn(%r13,%rax,8), %r13 00000000000009f6 incq %rbp 00000000000009f9 movq 0x1b0(%rsp), %rax 0000000000000a01 cmpl %eax, %ebp 0000000000000a03 jne 0x990 0000000000000a05 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000a10 movq 0x1a8(%rsp), %rdx 0000000000000a18 leal 0x3(%rdx), %eax 0000000000000a1b sarl $0x1f, %eax 0000000000000a1e shrl $0x1e, %eax 0000000000000a21 movq 0x148(%rsp), %rcx 0000000000000a29 testl %ecx, %ecx 0000000000000a2b movl 0x2e8(%rsp), %r9d 0000000000000a33 movsd 0x208(%rsp), %xmm1 0000000000000a3c jle .DPOSTACCUMULATE1+840 0000000000000a42 movl 0x21c(%rsp), %ecx 0000000000000a49 imull %r15d, %ecx 0000000000000a4d addl 0xec(%rsp), %ecx 0000000000000a54 leal 0x3(%rdx,%rax), %eax 0000000000000a58 sarl $0x2, %eax 0000000000000a5b movq %rax, 0x200(%rsp) 0000000000000a63 movslq %ecx, %rcx 0000000000000a66 movq %rcx, 0x1f0(%rsp) 0000000000000a6e leal -0x1(%rax), %eax 0000000000000a71 movl %eax, 0x1fc(%rsp) 0000000000000a78 xorl %esi, %esi 0000000000000a7a nopw _ULM_dgemm_nn(%rax,%rax) 0000000000000a80 movq %rsi, 0x1b8(%rsp) 0000000000000a88 movq 0x1a0(%rsp), %rcx 0000000000000a90 testl %ecx, %ecx 0000000000000a92 sete %al 0000000000000a95 cmpl 0x18c(%rsp), %esi 0000000000000a9c setne %bl 0000000000000a9f orb %al, %bl 0000000000000aa1 movb %bl, 0x230(%rsp) 0000000000000aa8 movl %ecx, %r13d 0000000000000aab movl $0x4, %eax 0000000000000ab0 cmovnel %eax, %r13d 0000000000000ab4 testl %edx, %edx 0000000000000ab6 jle .DPOSTACCUMULATE1+805 0000000000000abc movq 0x1b8(%rsp), %rdx 0000000000000ac4 movl %edx, %eax 0000000000000ac6 imull 0x22c(%rsp), %eax 0000000000000ace cltq 0000000000000ad0 leaq __B(%rip), %rcx 0000000000000ad7 leaq _ULM_dgemm_nn(%rcx,%rax,8), %rax 0000000000000adb movq %rax, 0x220(%rsp) 0000000000000ae3 movl %edx, %eax 0000000000000ae5 imull %r9d, %eax 0000000000000ae9 movl %eax, 0x21c(%rsp) 0000000000000af0 xorl %esi, %esi 0000000000000af2 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000000b00 movq 0x210(%rsp), %rdx 0000000000000b08 testl %edx, %edx 0000000000000b0a sete %al 0000000000000b0d cmpl 0x1fc(%rsp), %esi 0000000000000b14 setne %cl 0000000000000b17 orb %al, %cl 0000000000000b19 movl %edx, %r12d 0000000000000b1c movl $0x4, %eax 0000000000000b21 cmovnel %eax, %r12d 0000000000000b25 andb 0x230(%rsp), %cl 0000000000000b2c movl %esi, %eax 0000000000000b2e imull 0x22c(%rsp), %eax 0000000000000b36 cmpb $0x1, %cl 0000000000000b39 cltq 0000000000000b3b leaq __A(%rip), %rcx 0000000000000b42 leaq _ULM_dgemm_nn(%rcx,%rax,8), %rax 0000000000000b46 jne .DPOSTACCUMULATE0+420 0000000000000b4c movl %esi, %ecx 0000000000000b4e movq %rsi, 0x238(%rsp) 0000000000000b56 imull %r15d, %ecx 0000000000000b5a addl 0x21c(%rsp), %ecx 0000000000000b61 shll $0x2, %ecx 0000000000000b64 movslq %ecx, %rcx 0000000000000b67 addq 0x1f0(%rsp), %rcx 0000000000000b6f leaq _ULM_dgemm_nn(%r14,%rcx,8), %rcx 0000000000000b73 movsd 0x1d0(%rsp), %xmm0 0000000000000b7c movsd %xmm0, 0x280(%rsp) 0000000000000b85 movq %rax, 0x278(%rsp) 0000000000000b8d movq 0x220(%rsp), %rax 0000000000000b95 movq %rax, 0x270(%rsp) 0000000000000b9d movsd 0x1e8(%rsp), %xmm0 0000000000000ba6 movsd %xmm0, 0x268(%rsp) 0000000000000baf movq %rcx, 0x260(%rsp) 0000000000000bb7 movq 0x1c8(%rsp), %rax 0000000000000bbf movq %rax, 0x258(%rsp) 0000000000000bc7 movq 0x1c0(%rsp), %rax 0000000000000bcf movq %rax, 0x250(%rsp) 0000000000000bd7 movq 0x1d8(%rsp), %rax 0000000000000bdf movq %rax, 0x248(%rsp) 0000000000000be7 movq 0x1e0(%rsp), %rax 0000000000000bef movq %rax, 0x240(%rsp) 0000000000000bf7 movq 0x248(%rsp), %rsi 0000000000000bff movq 0x240(%rsp), %rdi 0000000000000c07 movq 0x278(%rsp), %rax 0000000000000c0f movq 0x270(%rsp), %rbx 0000000000000c17 movapd _ULM_dgemm_nn(%rax), %xmm0 0000000000000c1b movapd 0x10(%rax), %xmm1 0000000000000c20 movapd _ULM_dgemm_nn(%rbx), %xmm2 0000000000000c24 xorpd %xmm8, %xmm8 0000000000000c29 xorpd %xmm9, %xmm9 0000000000000c2e xorpd %xmm10, %xmm10 0000000000000c33 xorpd %xmm11, %xmm11 0000000000000c38 xorpd %xmm12, %xmm12 0000000000000c3d xorpd %xmm13, %xmm13 0000000000000c42 xorpd %xmm14, %xmm14 0000000000000c47 xorpd %xmm15, %xmm15 0000000000000c4c xorpd %xmm3, %xmm3 0000000000000c50 xorpd %xmm4, %xmm4 0000000000000c54 xorpd %xmm5, %xmm5 0000000000000c58 xorpd %xmm6, %xmm6 0000000000000c5c xorpd %xmm7, %xmm7 0000000000000c60 testq %rdi, %rdi 0000000000000c63 testq %rsi, %rsi 0000000000000c66 je .DCONSIDERLEFT0 .DLOOP0: 0000000000000c6c addpd %xmm3, %xmm12 0000000000000c71 movapd 0x10(%rbx), %xmm3 0000000000000c76 addpd %xmm6, %xmm13 0000000000000c7b movapd %xmm2, %xmm6 0000000000000c7f pshufd $0x4e, %xmm2, %xmm4 0000000000000c84 mulpd %xmm0, %xmm2 0000000000000c88 mulpd %xmm1, %xmm6 0000000000000c8c addpd %xmm5, %xmm14 0000000000000c91 addpd %xmm7, %xmm15 0000000000000c96 movapd %xmm4, %xmm7 0000000000000c9a mulpd %xmm0, %xmm4 0000000000000c9e mulpd %xmm1, %xmm7 0000000000000ca2 addpd %xmm2, %xmm8 0000000000000ca7 movapd 0x20(%rbx), %xmm2 0000000000000cac addpd %xmm6, %xmm9 0000000000000cb1 movapd %xmm3, %xmm6 0000000000000cb5 pshufd $0x4e, %xmm3, %xmm5 0000000000000cba mulpd %xmm0, %xmm3 0000000000000cbe mulpd %xmm1, %xmm6 0000000000000cc2 addpd %xmm4, %xmm10 0000000000000cc7 addpd %xmm7, %xmm11 0000000000000ccc movapd %xmm5, %xmm7 0000000000000cd0 mulpd %xmm0, %xmm5 0000000000000cd4 movapd 0x20(%rax), %xmm0 0000000000000cd9 mulpd %xmm1, %xmm7 0000000000000cdd movapd 0x30(%rax), %xmm1 0000000000000ce2 addpd %xmm3, %xmm12 0000000000000ce7 movapd 0x30(%rbx), %xmm3 0000000000000cec addpd %xmm6, %xmm13 0000000000000cf1 movapd %xmm2, %xmm6 0000000000000cf5 pshufd $0x4e, %xmm2, %xmm4 0000000000000cfa mulpd %xmm0, %xmm2 0000000000000cfe mulpd %xmm1, %xmm6 0000000000000d02 addpd %xmm5, %xmm14 0000000000000d07 addpd %xmm7, %xmm15 0000000000000d0c movapd %xmm4, %xmm7 0000000000000d10 mulpd %xmm0, %xmm4 0000000000000d14 mulpd %xmm1, %xmm7 0000000000000d18 addpd %xmm2, %xmm8 0000000000000d1d movapd 0x40(%rbx), %xmm2 0000000000000d22 addpd %xmm6, %xmm9 0000000000000d27 movapd %xmm3, %xmm6 0000000000000d2b pshufd $0x4e, %xmm3, %xmm5 0000000000000d30 mulpd %xmm0, %xmm3 0000000000000d34 mulpd %xmm1, %xmm6 0000000000000d38 addpd %xmm4, %xmm10 0000000000000d3d addpd %xmm7, %xmm11 0000000000000d42 movapd %xmm5, %xmm7 0000000000000d46 mulpd %xmm0, %xmm5 0000000000000d4a movapd 0x40(%rax), %xmm0 0000000000000d4f mulpd %xmm1, %xmm7 0000000000000d53 movapd 0x50(%rax), %xmm1 0000000000000d58 addpd %xmm3, %xmm12 0000000000000d5d movapd 0x50(%rbx), %xmm3 0000000000000d62 addpd %xmm6, %xmm13 0000000000000d67 movapd %xmm2, %xmm6 0000000000000d6b pshufd $0x4e, %xmm2, %xmm4 0000000000000d70 mulpd %xmm0, %xmm2 0000000000000d74 mulpd %xmm1, %xmm6 0000000000000d78 addpd %xmm5, %xmm14 0000000000000d7d addpd %xmm7, %xmm15 0000000000000d82 movapd %xmm4, %xmm7 0000000000000d86 mulpd %xmm0, %xmm4 0000000000000d8a mulpd %xmm1, %xmm7 0000000000000d8e addpd %xmm2, %xmm8 0000000000000d93 movapd 0x60(%rbx), %xmm2 0000000000000d98 addpd %xmm6, %xmm9 0000000000000d9d movapd %xmm3, %xmm6 0000000000000da1 pshufd $0x4e, %xmm3, %xmm5 0000000000000da6 mulpd %xmm0, %xmm3 0000000000000daa mulpd %xmm1, %xmm6 0000000000000dae addpd %xmm4, %xmm10 0000000000000db3 addpd %xmm7, %xmm11 0000000000000db8 movapd %xmm5, %xmm7 0000000000000dbc mulpd %xmm0, %xmm5 0000000000000dc0 movapd 0x60(%rax), %xmm0 0000000000000dc5 mulpd %xmm1, %xmm7 0000000000000dc9 movapd 0x70(%rax), %xmm1 0000000000000dce addpd %xmm3, %xmm12 0000000000000dd3 movapd 0x70(%rbx), %xmm3 0000000000000dd8 addpd %xmm6, %xmm13 0000000000000ddd movapd %xmm2, %xmm6 0000000000000de1 pshufd $0x4e, %xmm2, %xmm4 0000000000000de6 mulpd %xmm0, %xmm2 0000000000000dea mulpd %xmm1, %xmm6 0000000000000dee addq $0x80, %rax 0000000000000df4 addpd %xmm5, %xmm14 0000000000000df9 addpd %xmm7, %xmm15 0000000000000dfe movapd %xmm4, %xmm7 0000000000000e02 mulpd %xmm0, %xmm4 0000000000000e06 mulpd %xmm1, %xmm7 0000000000000e0a addpd %xmm2, %xmm8 0000000000000e0f movapd 0x80(%rbx), %xmm2 0000000000000e17 addpd %xmm6, %xmm9 0000000000000e1c movapd %xmm3, %xmm6 0000000000000e20 pshufd $0x4e, %xmm3, %xmm5 0000000000000e25 mulpd %xmm0, %xmm3 0000000000000e29 mulpd %xmm1, %xmm6 0000000000000e2d addq $0x80, %rbx 0000000000000e34 addpd %xmm4, %xmm10 0000000000000e39 addpd %xmm7, %xmm11 0000000000000e3e movapd %xmm5, %xmm7 0000000000000e42 mulpd %xmm0, %xmm5 0000000000000e46 movapd _ULM_dgemm_nn(%rax), %xmm0 0000000000000e4a mulpd %xmm1, %xmm7 0000000000000e4e movapd 0x10(%rax), %xmm1 0000000000000e53 decq %rsi 0000000000000e56 jne .DLOOP0 .DCONSIDERLEFT0: 0000000000000e5c testq %rdi, %rdi 0000000000000e5f je .DPOSTACCUMULATE0 .DLOOPLEFT0: 0000000000000e65 addpd %xmm3, %xmm12 0000000000000e6a movapd 0x10(%rbx), %xmm3 0000000000000e6f addpd %xmm6, %xmm13 0000000000000e74 movapd %xmm2, %xmm6 0000000000000e78 pshufd $0x4e, %xmm2, %xmm4 0000000000000e7d mulpd %xmm0, %xmm2 0000000000000e81 mulpd %xmm1, %xmm6 0000000000000e85 addpd %xmm5, %xmm14 0000000000000e8a addpd %xmm7, %xmm15 0000000000000e8f movapd %xmm4, %xmm7 0000000000000e93 mulpd %xmm0, %xmm4 0000000000000e97 mulpd %xmm1, %xmm7 0000000000000e9b addpd %xmm2, %xmm8 0000000000000ea0 movapd 0x20(%rbx), %xmm2 0000000000000ea5 addpd %xmm6, %xmm9 0000000000000eaa movapd %xmm3, %xmm6 0000000000000eae pshufd $0x4e, %xmm3, %xmm5 0000000000000eb3 mulpd %xmm0, %xmm3 0000000000000eb7 mulpd %xmm1, %xmm6 0000000000000ebb addpd %xmm4, %xmm10 0000000000000ec0 addpd %xmm7, %xmm11 0000000000000ec5 movapd %xmm5, %xmm7 0000000000000ec9 mulpd %xmm0, %xmm5 0000000000000ecd movapd 0x20(%rax), %xmm0 0000000000000ed2 mulpd %xmm1, %xmm7 0000000000000ed6 movapd 0x30(%rax), %xmm1 0000000000000edb addq $0x20, %rax 0000000000000edf addq $0x20, %rbx 0000000000000ee3 decq %rdi 0000000000000ee6 jne .DLOOPLEFT0 .DPOSTACCUMULATE0: 0000000000000eec addpd %xmm3, %xmm12 0000000000000ef1 addpd %xmm6, %xmm13 0000000000000ef6 addpd %xmm5, %xmm14 0000000000000efb addpd %xmm7, %xmm15 0000000000000f00 movsd 0x280(%rsp), %xmm0 0000000000000f09 movsd 0x268(%rsp), %xmm1 0000000000000f12 movq 0x260(%rsp), %rcx 0000000000000f1a movq 0x258(%rsp), %r8 0000000000000f22 leaq _ULM_dgemm_nn(,%r8,8), %r8 0000000000000f2a movq 0x250(%rsp), %r9 0000000000000f32 leaq _ULM_dgemm_nn(,%r9,8), %r9 0000000000000f3a leaq _ULM_dgemm_nn(%rcx,%r9), %r10 0000000000000f3e leaq _ULM_dgemm_nn(%rcx,%r8,2), %rdx 0000000000000f42 leaq _ULM_dgemm_nn(%rdx,%r9), %r11 0000000000000f46 unpcklpd %xmm0, %xmm0 0000000000000f4a unpcklpd %xmm1, %xmm1 0000000000000f4e movlpd _ULM_dgemm_nn(%rcx), %xmm3 0000000000000f52 movhpd _ULM_dgemm_nn(%r10,%r8), %xmm3 0000000000000f58 mulpd %xmm0, %xmm8 0000000000000f5d mulpd %xmm1, %xmm3 0000000000000f61 addpd %xmm8, %xmm3 0000000000000f66 movlpd %xmm3, _ULM_dgemm_nn(%rcx) 0000000000000f6a movhpd %xmm3, _ULM_dgemm_nn(%r10,%r8) 0000000000000f70 movlpd _ULM_dgemm_nn(%rdx), %xmm4 0000000000000f74 movhpd _ULM_dgemm_nn(%r11,%r8), %xmm4 0000000000000f7a mulpd %xmm0, %xmm9 0000000000000f7f mulpd %xmm1, %xmm4 0000000000000f83 addpd %xmm9, %xmm4 0000000000000f88 movlpd %xmm4, _ULM_dgemm_nn(%rdx) 0000000000000f8c movhpd %xmm4, _ULM_dgemm_nn(%r11,%r8) 0000000000000f92 movlpd _ULM_dgemm_nn(%r10), %xmm3 0000000000000f97 movhpd _ULM_dgemm_nn(%rcx,%r8), %xmm3 0000000000000f9d mulpd %xmm0, %xmm10 0000000000000fa2 mulpd %xmm1, %xmm3 0000000000000fa6 addpd %xmm10, %xmm3 0000000000000fab movlpd %xmm3, _ULM_dgemm_nn(%r10) 0000000000000fb0 movhpd %xmm3, _ULM_dgemm_nn(%rcx,%r8) 0000000000000fb6 movlpd _ULM_dgemm_nn(%r11), %xmm4 0000000000000fbb movhpd _ULM_dgemm_nn(%rdx,%r8), %xmm4 0000000000000fc1 mulpd %xmm0, %xmm11 0000000000000fc6 mulpd %xmm1, %xmm4 0000000000000fca addpd %xmm11, %xmm4 0000000000000fcf movlpd %xmm4, _ULM_dgemm_nn(%r11) 0000000000000fd4 movhpd %xmm4, _ULM_dgemm_nn(%rdx,%r8) 0000000000000fda leaq _ULM_dgemm_nn(%rcx,%r9,2), %rcx 0000000000000fde leaq _ULM_dgemm_nn(%r10,%r9,2), %r10 0000000000000fe2 leaq _ULM_dgemm_nn(%rdx,%r9,2), %rdx 0000000000000fe6 leaq _ULM_dgemm_nn(%r11,%r9,2), %r11 0000000000000fea movlpd _ULM_dgemm_nn(%rcx), %xmm3 0000000000000fee movhpd _ULM_dgemm_nn(%r10,%r8), %xmm3 0000000000000ff4 mulpd %xmm0, %xmm12 0000000000000ff9 mulpd %xmm1, %xmm3 0000000000000ffd addpd %xmm12, %xmm3 0000000000001002 movlpd %xmm3, _ULM_dgemm_nn(%rcx) 0000000000001006 movhpd %xmm3, _ULM_dgemm_nn(%r10,%r8) 000000000000100c movlpd _ULM_dgemm_nn(%rdx), %xmm4 0000000000001010 movhpd _ULM_dgemm_nn(%r11,%r8), %xmm4 0000000000001016 mulpd %xmm0, %xmm13 000000000000101b mulpd %xmm1, %xmm4 000000000000101f addpd %xmm13, %xmm4 0000000000001024 movlpd %xmm4, _ULM_dgemm_nn(%rdx) 0000000000001028 movhpd %xmm4, _ULM_dgemm_nn(%r11,%r8) 000000000000102e movlpd _ULM_dgemm_nn(%r10), %xmm3 0000000000001033 movhpd _ULM_dgemm_nn(%rcx,%r8), %xmm3 0000000000001039 mulpd %xmm0, %xmm14 000000000000103e mulpd %xmm1, %xmm3 0000000000001042 addpd %xmm14, %xmm3 0000000000001047 movlpd %xmm3, _ULM_dgemm_nn(%r10) 000000000000104c movhpd %xmm3, _ULM_dgemm_nn(%rcx,%r8) 0000000000001052 movlpd _ULM_dgemm_nn(%r11), %xmm4 0000000000001057 movhpd _ULM_dgemm_nn(%rdx,%r8), %xmm4 000000000000105d mulpd %xmm0, %xmm15 0000000000001062 mulpd %xmm1, %xmm4 0000000000001066 addpd %xmm15, %xmm4 000000000000106b movlpd %xmm4, _ULM_dgemm_nn(%r11) 0000000000001070 movhpd %xmm4, _ULM_dgemm_nn(%rdx,%r8) 0000000000001076 movl 0x2e8(%rsp), %r9d 000000000000107e jmpq .DPOSTACCUMULATE1+769 0000000000001083 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000001090 testl %r13d, %r13d 0000000000001093 setg 0x238(%rsp) 000000000000109b movsd 0x1d0(%rsp), %xmm0 00000000000010a4 movsd %xmm0, 0x280(%rsp) 00000000000010ad movq %rax, 0x278(%rsp) 00000000000010b5 movq 0x220(%rsp), %rax 00000000000010bd movq %rax, 0x270(%rsp) 00000000000010c5 movq $_ULM_dgemm_nn, 0x268(%rsp) 00000000000010d1 leaq __C(%rip), %rax 00000000000010d8 movq %rax, 0x260(%rsp) 00000000000010e0 movq $0x1, 0x258(%rsp) 00000000000010ec movq $0x4, 0x250(%rsp) 00000000000010f8 movq 0x1d8(%rsp), %rax 0000000000001100 movq %rax, 0x248(%rsp) 0000000000001108 movq 0x1e0(%rsp), %rax 0000000000001110 movq %rax, 0x240(%rsp) 0000000000001118 movq %rsi, %rbp 000000000000111b movq 0x248(%rsp), %rsi 0000000000001123 movq 0x240(%rsp), %rdi 000000000000112b movq 0x278(%rsp), %rax 0000000000001133 movq 0x270(%rsp), %rbx 000000000000113b movapd _ULM_dgemm_nn(%rax), %xmm0 000000000000113f movapd 0x10(%rax), %xmm1 0000000000001144 movapd _ULM_dgemm_nn(%rbx), %xmm2 0000000000001148 xorpd %xmm8, %xmm8 000000000000114d xorpd %xmm9, %xmm9 0000000000001152 xorpd %xmm10, %xmm10 0000000000001157 xorpd %xmm11, %xmm11 000000000000115c xorpd %xmm12, %xmm12 0000000000001161 xorpd %xmm13, %xmm13 0000000000001166 xorpd %xmm14, %xmm14 000000000000116b xorpd %xmm15, %xmm15 0000000000001170 xorpd %xmm3, %xmm3 0000000000001174 xorpd %xmm4, %xmm4 0000000000001178 xorpd %xmm5, %xmm5 000000000000117c xorpd %xmm6, %xmm6 0000000000001180 xorpd %xmm7, %xmm7 0000000000001184 testq %rdi, %rdi 0000000000001187 testq %rsi, %rsi 000000000000118a je .DCONSIDERLEFT1 .DLOOP1: 0000000000001190 addpd %xmm3, %xmm12 0000000000001195 movapd 0x10(%rbx), %xmm3 000000000000119a addpd %xmm6, %xmm13 000000000000119f movapd %xmm2, %xmm6 00000000000011a3 pshufd $0x4e, %xmm2, %xmm4 00000000000011a8 mulpd %xmm0, %xmm2 00000000000011ac mulpd %xmm1, %xmm6 00000000000011b0 addpd %xmm5, %xmm14 00000000000011b5 addpd %xmm7, %xmm15 00000000000011ba movapd %xmm4, %xmm7 00000000000011be mulpd %xmm0, %xmm4 00000000000011c2 mulpd %xmm1, %xmm7 00000000000011c6 addpd %xmm2, %xmm8 00000000000011cb movapd 0x20(%rbx), %xmm2 00000000000011d0 addpd %xmm6, %xmm9 00000000000011d5 movapd %xmm3, %xmm6 00000000000011d9 pshufd $0x4e, %xmm3, %xmm5 00000000000011de mulpd %xmm0, %xmm3 00000000000011e2 mulpd %xmm1, %xmm6 00000000000011e6 addpd %xmm4, %xmm10 00000000000011eb addpd %xmm7, %xmm11 00000000000011f0 movapd %xmm5, %xmm7 00000000000011f4 mulpd %xmm0, %xmm5 00000000000011f8 movapd 0x20(%rax), %xmm0 00000000000011fd mulpd %xmm1, %xmm7 0000000000001201 movapd 0x30(%rax), %xmm1 0000000000001206 addpd %xmm3, %xmm12 000000000000120b movapd 0x30(%rbx), %xmm3 0000000000001210 addpd %xmm6, %xmm13 0000000000001215 movapd %xmm2, %xmm6 0000000000001219 pshufd $0x4e, %xmm2, %xmm4 000000000000121e mulpd %xmm0, %xmm2 0000000000001222 mulpd %xmm1, %xmm6 0000000000001226 addpd %xmm5, %xmm14 000000000000122b addpd %xmm7, %xmm15 0000000000001230 movapd %xmm4, %xmm7 0000000000001234 mulpd %xmm0, %xmm4 0000000000001238 mulpd %xmm1, %xmm7 000000000000123c addpd %xmm2, %xmm8 0000000000001241 movapd 0x40(%rbx), %xmm2 0000000000001246 addpd %xmm6, %xmm9 000000000000124b movapd %xmm3, %xmm6 000000000000124f pshufd $0x4e, %xmm3, %xmm5 0000000000001254 mulpd %xmm0, %xmm3 0000000000001258 mulpd %xmm1, %xmm6 000000000000125c addpd %xmm4, %xmm10 0000000000001261 addpd %xmm7, %xmm11 0000000000001266 movapd %xmm5, %xmm7 000000000000126a mulpd %xmm0, %xmm5 000000000000126e movapd 0x40(%rax), %xmm0 0000000000001273 mulpd %xmm1, %xmm7 0000000000001277 movapd 0x50(%rax), %xmm1 000000000000127c addpd %xmm3, %xmm12 0000000000001281 movapd 0x50(%rbx), %xmm3 0000000000001286 addpd %xmm6, %xmm13 000000000000128b movapd %xmm2, %xmm6 000000000000128f pshufd $0x4e, %xmm2, %xmm4 0000000000001294 mulpd %xmm0, %xmm2 0000000000001298 mulpd %xmm1, %xmm6 000000000000129c addpd %xmm5, %xmm14 00000000000012a1 addpd %xmm7, %xmm15 00000000000012a6 movapd %xmm4, %xmm7 00000000000012aa mulpd %xmm0, %xmm4 00000000000012ae mulpd %xmm1, %xmm7 00000000000012b2 addpd %xmm2, %xmm8 00000000000012b7 movapd 0x60(%rbx), %xmm2 00000000000012bc addpd %xmm6, %xmm9 00000000000012c1 movapd %xmm3, %xmm6 00000000000012c5 pshufd $0x4e, %xmm3, %xmm5 00000000000012ca mulpd %xmm0, %xmm3 00000000000012ce mulpd %xmm1, %xmm6 00000000000012d2 addpd %xmm4, %xmm10 00000000000012d7 addpd %xmm7, %xmm11 00000000000012dc movapd %xmm5, %xmm7 00000000000012e0 mulpd %xmm0, %xmm5 00000000000012e4 movapd 0x60(%rax), %xmm0 00000000000012e9 mulpd %xmm1, %xmm7 00000000000012ed movapd 0x70(%rax), %xmm1 00000000000012f2 addpd %xmm3, %xmm12 00000000000012f7 movapd 0x70(%rbx), %xmm3 00000000000012fc addpd %xmm6, %xmm13 0000000000001301 movapd %xmm2, %xmm6 0000000000001305 pshufd $0x4e, %xmm2, %xmm4 000000000000130a mulpd %xmm0, %xmm2 000000000000130e mulpd %xmm1, %xmm6 0000000000001312 addq $0x80, %rax 0000000000001318 addpd %xmm5, %xmm14 000000000000131d addpd %xmm7, %xmm15 0000000000001322 movapd %xmm4, %xmm7 0000000000001326 mulpd %xmm0, %xmm4 000000000000132a mulpd %xmm1, %xmm7 000000000000132e addpd %xmm2, %xmm8 0000000000001333 movapd 0x80(%rbx), %xmm2 000000000000133b addpd %xmm6, %xmm9 0000000000001340 movapd %xmm3, %xmm6 0000000000001344 pshufd $0x4e, %xmm3, %xmm5 0000000000001349 mulpd %xmm0, %xmm3 000000000000134d mulpd %xmm1, %xmm6 0000000000001351 addq $0x80, %rbx 0000000000001358 addpd %xmm4, %xmm10 000000000000135d addpd %xmm7, %xmm11 0000000000001362 movapd %xmm5, %xmm7 0000000000001366 mulpd %xmm0, %xmm5 000000000000136a movapd _ULM_dgemm_nn(%rax), %xmm0 000000000000136e mulpd %xmm1, %xmm7 0000000000001372 movapd 0x10(%rax), %xmm1 0000000000001377 decq %rsi 000000000000137a jne .DLOOP1 .DCONSIDERLEFT1: 0000000000001380 testq %rdi, %rdi 0000000000001383 je .DPOSTACCUMULATE1 .DLOOPLEFT1: 0000000000001389 addpd %xmm3, %xmm12 000000000000138e movapd 0x10(%rbx), %xmm3 0000000000001393 addpd %xmm6, %xmm13 0000000000001398 movapd %xmm2, %xmm6 000000000000139c pshufd $0x4e, %xmm2, %xmm4 00000000000013a1 mulpd %xmm0, %xmm2 00000000000013a5 mulpd %xmm1, %xmm6 00000000000013a9 addpd %xmm5, %xmm14 00000000000013ae addpd %xmm7, %xmm15 00000000000013b3 movapd %xmm4, %xmm7 00000000000013b7 mulpd %xmm0, %xmm4 00000000000013bb mulpd %xmm1, %xmm7 00000000000013bf addpd %xmm2, %xmm8 00000000000013c4 movapd 0x20(%rbx), %xmm2 00000000000013c9 addpd %xmm6, %xmm9 00000000000013ce movapd %xmm3, %xmm6 00000000000013d2 pshufd $0x4e, %xmm3, %xmm5 00000000000013d7 mulpd %xmm0, %xmm3 00000000000013db mulpd %xmm1, %xmm6 00000000000013df addpd %xmm4, %xmm10 00000000000013e4 addpd %xmm7, %xmm11 00000000000013e9 movapd %xmm5, %xmm7 00000000000013ed mulpd %xmm0, %xmm5 00000000000013f1 movapd 0x20(%rax), %xmm0 00000000000013f6 mulpd %xmm1, %xmm7 00000000000013fa movapd 0x30(%rax), %xmm1 00000000000013ff addq $0x20, %rax 0000000000001403 addq $0x20, %rbx 0000000000001407 decq %rdi 000000000000140a jne .DLOOPLEFT1 .DPOSTACCUMULATE1: 0000000000001410 addpd %xmm3, %xmm12 0000000000001415 addpd %xmm6, %xmm13 000000000000141a addpd %xmm5, %xmm14 000000000000141f addpd %xmm7, %xmm15 0000000000001424 movsd 0x280(%rsp), %xmm0 000000000000142d movsd 0x268(%rsp), %xmm1 0000000000001436 movq 0x260(%rsp), %rcx 000000000000143e movq 0x258(%rsp), %r8 0000000000001446 leaq _ULM_dgemm_nn(,%r8,8), %r8 000000000000144e movq 0x250(%rsp), %r9 0000000000001456 leaq _ULM_dgemm_nn(,%r9,8), %r9 000000000000145e leaq _ULM_dgemm_nn(%rcx,%r9), %r10 0000000000001462 leaq _ULM_dgemm_nn(%rcx,%r8,2), %rdx 0000000000001466 leaq _ULM_dgemm_nn(%rdx,%r9), %r11 000000000000146a unpcklpd %xmm0, %xmm0 000000000000146e unpcklpd %xmm1, %xmm1 0000000000001472 movlpd _ULM_dgemm_nn(%rcx), %xmm3 0000000000001476 movhpd _ULM_dgemm_nn(%r10,%r8), %xmm3 000000000000147c mulpd %xmm0, %xmm8 0000000000001481 mulpd %xmm1, %xmm3 0000000000001485 addpd %xmm8, %xmm3 000000000000148a movlpd %xmm3, _ULM_dgemm_nn(%rcx) 000000000000148e movhpd %xmm3, _ULM_dgemm_nn(%r10,%r8) 0000000000001494 movlpd _ULM_dgemm_nn(%rdx), %xmm4 0000000000001498 movhpd _ULM_dgemm_nn(%r11,%r8), %xmm4 000000000000149e mulpd %xmm0, %xmm9 00000000000014a3 mulpd %xmm1, %xmm4 00000000000014a7 addpd %xmm9, %xmm4 00000000000014ac movlpd %xmm4, _ULM_dgemm_nn(%rdx) 00000000000014b0 movhpd %xmm4, _ULM_dgemm_nn(%r11,%r8) 00000000000014b6 movlpd _ULM_dgemm_nn(%r10), %xmm3 00000000000014bb movhpd _ULM_dgemm_nn(%rcx,%r8), %xmm3 00000000000014c1 mulpd %xmm0, %xmm10 00000000000014c6 mulpd %xmm1, %xmm3 00000000000014ca addpd %xmm10, %xmm3 00000000000014cf movlpd %xmm3, _ULM_dgemm_nn(%r10) 00000000000014d4 movhpd %xmm3, _ULM_dgemm_nn(%rcx,%r8) 00000000000014da movlpd _ULM_dgemm_nn(%r11), %xmm4 00000000000014df movhpd _ULM_dgemm_nn(%rdx,%r8), %xmm4 00000000000014e5 mulpd %xmm0, %xmm11 00000000000014ea mulpd %xmm1, %xmm4 00000000000014ee addpd %xmm11, %xmm4 00000000000014f3 movlpd %xmm4, _ULM_dgemm_nn(%r11) 00000000000014f8 movhpd %xmm4, _ULM_dgemm_nn(%rdx,%r8) 00000000000014fe leaq _ULM_dgemm_nn(%rcx,%r9,2), %rcx 0000000000001502 leaq _ULM_dgemm_nn(%r10,%r9,2), %r10 0000000000001506 leaq _ULM_dgemm_nn(%rdx,%r9,2), %rdx 000000000000150a leaq _ULM_dgemm_nn(%r11,%r9,2), %r11 000000000000150e movlpd _ULM_dgemm_nn(%rcx), %xmm3 0000000000001512 movhpd _ULM_dgemm_nn(%r10,%r8), %xmm3 0000000000001518 mulpd %xmm0, %xmm12 000000000000151d mulpd %xmm1, %xmm3 0000000000001521 addpd %xmm12, %xmm3 0000000000001526 movlpd %xmm3, _ULM_dgemm_nn(%rcx) 000000000000152a movhpd %xmm3, _ULM_dgemm_nn(%r10,%r8) 0000000000001530 movlpd _ULM_dgemm_nn(%rdx), %xmm4 0000000000001534 movhpd _ULM_dgemm_nn(%r11,%r8), %xmm4 000000000000153a mulpd %xmm0, %xmm13 000000000000153f mulpd %xmm1, %xmm4 0000000000001543 addpd %xmm13, %xmm4 0000000000001548 movlpd %xmm4, _ULM_dgemm_nn(%rdx) 000000000000154c movhpd %xmm4, _ULM_dgemm_nn(%r11,%r8) 0000000000001552 movlpd _ULM_dgemm_nn(%r10), %xmm3 0000000000001557 movhpd _ULM_dgemm_nn(%rcx,%r8), %xmm3 000000000000155d mulpd %xmm0, %xmm14 0000000000001562 mulpd %xmm1, %xmm3 0000000000001566 addpd %xmm14, %xmm3 000000000000156b movlpd %xmm3, _ULM_dgemm_nn(%r10) 0000000000001570 movhpd %xmm3, _ULM_dgemm_nn(%rcx,%r8) 0000000000001576 movlpd _ULM_dgemm_nn(%r11), %xmm4 000000000000157b movhpd _ULM_dgemm_nn(%rdx,%r8), %xmm4 0000000000001581 mulpd %xmm0, %xmm15 0000000000001586 mulpd %xmm1, %xmm4 000000000000158a addpd %xmm15, %xmm4 000000000000158f movlpd %xmm4, _ULM_dgemm_nn(%r11) 0000000000001594 movhpd %xmm4, _ULM_dgemm_nn(%rdx,%r8) 000000000000159a movq %rbp, %rdx 000000000000159d movl %edx, %eax 000000000000159f imull %r15d, %eax 00000000000015a3 addl 0x21c(%rsp), %eax 00000000000015aa shll $0x2, %eax 00000000000015ad testl %r12d, %r12d 00000000000015b0 setg %cl 00000000000015b3 andb 0x238(%rsp), %cl 00000000000015ba cltq 00000000000015bc movsd 0x1e8(%rsp), %xmm1 00000000000015c5 ucomisd 0x26b(%rip), %xmm1 00000000000015cd jne 0x15d3 00000000000015cf jp 0x15d3 00000000000015d1 jmp 0x1640 00000000000015d3 movq %rdx, 0x238(%rsp) 00000000000015db testb %cl, %cl 00000000000015dd movl 0x2e8(%rsp), %r9d 00000000000015e5 je 0x16a0 00000000000015eb movq 0x1f0(%rsp), %rcx 00000000000015f3 leaq _ULM_dgemm_nn(%rax,%rcx), %rcx 00000000000015f7 xorl %edx, %edx 00000000000015f9 xorl %esi, %esi 00000000000015fb nopl _ULM_dgemm_nn(%rax,%rax) 0000000000001600 movl %edx, %edi 0000000000001602 movl %r12d, %ebp 0000000000001605 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000001610 movslq %edi, %rdi 0000000000001613 leaq _ULM_dgemm_nn(%rcx,%rdi), %rbx 0000000000001617 movsd _ULM_dgemm_nn(%r14,%rbx,8), %xmm0 000000000000161d mulsd %xmm1, %xmm0 0000000000001621 movsd %xmm0, _ULM_dgemm_nn(%r14,%rbx,8) 0000000000001627 addl %r15d, %edi 000000000000162a decl %ebp 000000000000162c jne 0x1610 000000000000162e incl %esi 0000000000001630 addl %r9d, %edx 0000000000001633 cmpl %r13d, %esi 0000000000001636 jne 0x1600 0000000000001638 jmp 0x16a0 000000000000163a nopw _ULM_dgemm_nn(%rax,%rax) 0000000000001640 movq %rdx, 0x238(%rsp) 0000000000001648 testb %cl, %cl 000000000000164a movl 0x2e8(%rsp), %r9d 0000000000001652 je 0x16a0 0000000000001654 movq 0x1f0(%rsp), %rcx 000000000000165c leaq _ULM_dgemm_nn(%rax,%rcx), %rcx 0000000000001660 xorl %edx, %edx 0000000000001662 xorl %esi, %esi 0000000000001664 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000001670 movl %edx, %edi 0000000000001672 movl %r12d, %ebp 0000000000001675 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 0000000000001680 movslq %edi, %rdi 0000000000001683 leaq _ULM_dgemm_nn(%rcx,%rdi), %rbx 0000000000001687 movq $_ULM_dgemm_nn, _ULM_dgemm_nn(%r14,%rbx,8) 000000000000168f addl %r15d, %edi 0000000000001692 decl %ebp 0000000000001694 jne 0x1680 0000000000001696 incl %esi 0000000000001698 addl %r9d, %edx 000000000000169b cmpl %r13d, %esi 000000000000169e jne 0x1670 00000000000016a0 testl %r13d, %r13d 00000000000016a3 jle 0x1711 00000000000016a5 addq 0x1f0(%rsp), %rax 00000000000016ad xorl %r8d, %r8d 00000000000016b0 xorl %edx, %edx 00000000000016b2 xorl %esi, %esi 00000000000016b4 nopw %cs:_ULM_dgemm_nn(%rax,%rax) 00000000000016c0 testl %r12d, %r12d 00000000000016c3 jle 0x1702 00000000000016c5 movslq %r8d, %rdi 00000000000016c8 leaq __C(%rip), %rcx 00000000000016cf leaq _ULM_dgemm_nn(%rcx,%rdi,8), %rdi 00000000000016d3 movl %edx, %ebx 00000000000016d5 movl %r12d, %ebp 00000000000016d8 nopl _ULM_dgemm_nn(%rax,%rax) 00000000000016e0 movsd _ULM_dgemm_nn(%rdi), %xmm0 00000000000016e4 movslq %ebx, %rbx 00000000000016e7 leaq _ULM_dgemm_nn(%rax,%rbx), %rcx 00000000000016eb addsd _ULM_dgemm_nn(%r14,%rcx,8), %xmm0 00000000000016f1 movsd %xmm0, _ULM_dgemm_nn(%r14,%rcx,8) 00000000000016f7 addl %r15d, %ebx 00000000000016fa addq $0x8, %rdi 00000000000016fe decl %ebp 0000000000001700 jne 0x16e0 0000000000001702 incq %rsi 0000000000001705 addl %r9d, %edx 0000000000001708 addl $0x4, %r8d 000000000000170c cmpl %r13d, %esi 000000000000170f jne 0x16c0 0000000000001711 movq 0x238(%rsp), %rsi 0000000000001719 incq %rsi 000000000000171c movq 0x200(%rsp), %rax 0000000000001724 cmpl %eax, %esi 0000000000001726 movsd 0x208(%rsp), %xmm1 000000000000172f jl _ULM_dgemm_nn+2816 0000000000001735 movq 0x1b8(%rsp), %rsi 000000000000173d incq %rsi 0000000000001740 movq 0x190(%rsp), %rax 0000000000001748 cmpl %eax, %esi 000000000000174a movq 0x1a8(%rsp), %rdx 0000000000001752 jl _ULM_dgemm_nn+2688 0000000000001758 movq 0x158(%rsp), %rdi 0000000000001760 incq %rdi 0000000000001763 movl 0x164(%rsp), %esi 000000000000176a addl 0x114(%rsp), %esi 0000000000001771 movq 0x118(%rsp), %rax 0000000000001779 cmpl %eax, %edi 000000000000177b movq 0x140(%rsp), %rbp 0000000000001783 movq 0x138(%rsp), %rbx 000000000000178b jl _ULM_dgemm_nn+1888 0000000000001791 movq 0xb0(%rsp), %rdi 0000000000001799 incq %rdi 000000000000179c movl 0xac(%rsp), %r13d 00000000000017a4 addl 0x7c(%rsp), %r13d 00000000000017a9 movq 0xb8(%rsp), %rax 00000000000017b1 addl 0x64(%rsp), %eax 00000000000017b5 movq %rax, 0xb8(%rsp) 00000000000017bd movq 0x80(%rsp), %rax 00000000000017c5 cmpl %eax, %edi 00000000000017c7 movl 0x2e8(%rsp), %ebx 00000000000017ce movl %ebx, %r12d 00000000000017d1 movq 0x88(%rsp), %rsi 00000000000017d9 movq 0x70(%rsp), %rdx 00000000000017de movq 0x68(%rsp), %rbx 00000000000017e3 movq 0x58(%rsp), %r11 00000000000017e8 movq 0x1a0(%rsp), %r8 00000000000017f0 jl _ULM_dgemm_nn+1120 00000000000017f6 movq %rdx, %rdi 00000000000017f9 movl 0x34(%rsp), %ecx 00000000000017fd incl %ecx 00000000000017ff movq 0x38(%rsp), %rax 0000000000001804 addl 0x14(%rsp), %eax 0000000000001808 movq %rax, 0x38(%rsp) 000000000000180d movq 0x18(%rsp), %rax 0000000000001812 cmpl %eax, %ecx 0000000000001814 jl _ULM_dgemm_nn+832 000000000000181a addq $0x288, %rsp 0000000000001821 popq %rbx 0000000000001822 popq %r12 0000000000001824 popq %r13 0000000000001826 popq %r14 0000000000001828 popq %r15 000000000000182a popq %rbp 000000000000182b ret
So the line right before the .DLOOP0 label is
$shell> otool -dtV dgemm_nn.o | while read line; do if test "$line" = ".DLOOP0:"; then echo $last; fi; last=$line; done
0000000000000c66 je .DCONSIDERLEFT0
And the line with jne .DLOOP0 is
$shell> otool -dtV dgemm_nn.o | grep ".DLOOP0$"
0000000000000e56 jne .DLOOP0
So the code in between takes
$shell> BEGIN=`otool -dtV dgemm_nn.o | while read line; do if test "$line" = ".DLOOP0:"; then echo $last; fi; last=$line; done` $shell> BEGIN=($BEGIN) $shell> BEGIN=${BEGIN[0]} $shell> BEGIN=`echo $BEGIN | tr "a-f" "A-F"` $shell> END=`otool -dtV dgemm_nn.o | grep ".DLOOP0$" | tr "a-f" "A-F"` $shell> END=($END) $shell> END=${END[0]} $shell> END=`echo $END | tr "a-f" "A-F"` $shell> SIZE=`dc -e "16 i $END $BEGIN - f"` $shell> echo "Code size of loop body (in bytes): $SIZE" Code size of loop body (in bytes): 496
bytes.