GEMM: From Pure C to SSE Optimized Micro Kernels

Note: Unfortunately on NA Digest I posted the https URL of this site. As our server uses only a self signed SSL certificate that is inconvenient, e.g. some browsers will not display formulas properly even if you trust the certificate. Use the http URL
http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html
instead. In the mean time I will order a proper signed certificate.

On the next pages we try to discover how BLIS can achieve such a great performance. For this journey we set up our own BLAS implementation!

In our ulmBLAS project we have implemented a simple matrix-matrix product that follows the ideas described in BLIS: A Framework for Rapidly Instantiating BLAS Functionality.

Page 1 How to obtain the ulmBLAS project.
Page 2 Pure C implementation
Page 3 Naive Use of SSE Intrinsics.
Page 4 Applying loop unrolling to the previous implementation.
Page 5 Another SSE Intrinsics Approach which is based on the BLIS micro kernel for SSE architectures.
Page 6 Improving pipelining by reordering SSE intrinsics.
Page 7 Limitations of SSE intrinsics.
Page 8 We go nuclear and translate the intrinsics to assember by ourself!
Page 9 Unrolling the nuke: demo-asm-unrolled.
Page 10 Fine-tuning the unrolled assembler kernel.
Page 11 More fine-tuning of the unrolled assembler kernel.
Page 12 Preparation for adding prefetching: Porting the rest of the micro kernel to assembler.
Page 13 Adding prefetching.
Page 14 Benchmarking! Comparing the performance with MKL, ATLAS, Eigen and the original BLIS micro kernel.

Note that all benchmarks on these pages were generated when doctool transformed the doc files to HTML. All this happened on my MacBook Pro which has a 2.4 GHz Intel Core 2 Duo (P8600, “Penryn”). The theoretical peak performance of one core is 9.6 GFLOPS.

Back to the main course