Skip to content

Developer manual

Musen edited this page Feb 6, 2018 · 16 revisions

Source codes Layout

OpenBLAS/  
├── benchmark                  Benchmark codes for BLAS
├── cmake                      CMakefiles
├── ctest                      Test codes for CBLAS interfaces
├── driver                     Implement in C
│   ├── level2
│   ├── level3
│   ├── mapper
│   └── others                 Memory management, threading, etc
├── exports                    Generate shared library
├── interface                  Implement BLAS and CBLAS interfaces (calling driver or kernel)
│   ├── lapack
│   └── netlib
├── kernel                     Optimized assembly kernels for CPU architectures
│   ├── alpha
│   ├── arm
│   ├── arm64
│   ├── generic                General kernel codes by C.
│   ├── ia64
│   ├── mips64
│   ├── power
│   ├── sparc
│   ├── x86
│   └── x86_64
├── lapack                      Optimized LAPACK codes
│   ├── getf2
│   ├── getrf
│   ├── getrs
│   ├── laswp
│   ├── lauu2
│   ├── lauum
│   ├── potf2
│   ├── potrf
│   ├── trti2
│   └── trtri
├── lapack-netlib               LAPACK codes from netlib
├── reference                   BLAS Fortran reference implementation
├── test                        Test codes for BLAS
└── utest                       Regression test

A call tree for dgemm is as following.

interface/gemm.c
        │
driver/level3/level3.c
        │
gemm assembly kernels at kernel/

To find kernel for your architecture, please check kernel/$(ARCH)/KERNEL.$(CPU) file.

Here is an example for kernel/x86_64/KERNEL.HASWELL

...
DTRMMKERNEL    =  dtrmm_kernel_4x8_haswell.c
DGEMMKERNEL    =  dgemm_kernel_4x8_haswell.S
...

According to the above KERNEL.HASWELL, OpenBLAS Haswell dgemm kernel file is dgemm_kernel_4x8_haswell.S.

Optimize GEMM

Read the Goto paper to understand the algorithm.

Goto, Kazushige; van de Geijn, Robert A. (2008). "Anatomy of High-Performance Matrix Multiplication". ACM Transactions on Mathematical Software 34 (3): Article 12

The driver/level3/level3.c is the implementation of Goto's algorithm. Meanwhile, you can look at kernel/generic/gemmkernel_2x2.c, which is a naive 2x2 register blocking gemm kernel in C.

Then,

  • Write optimized assembly kernels. consider instruction pipeline, available registers, memory/cache accessing
  • Tuning cache block size, Mc, Kc, and Nc
  • Done

Run OpenBLAS Test

We use netlib blas test, cblas test, and LAPACK test. Meanwhile, we use BLAS-Tester, a modified test tool from ATLAS.

  • Run test and ctest at OpenBLAS. e.g. make test or make ctest.
  • Run regression test utest at OpenBLAS.
  • Run LAPACK test. e.g. make lapack-test.
  • Clone BLAS-Tester, which can compare the OpenBLAS result with netlib reference BLAS.

We also setup buildbot at http://build.openblas.net

Clone this wiki locally