-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Developer manual
OpenBLAS/
├── benchmark Benchmark codes for BLAS
├── cmake CMakefiles
├── ctest Test codes for CBLAS interfaces
├── driver Implement in C
│ ├── level2
│ ├── level3
│ ├── mapper
│ └── others Memory management, threading, etc
├── exports Generate shared library
├── interface Implement BLAS and CBLAS interfaces (calling driver or kernel)
│ ├── lapack
│ └── netlib
├── kernel Optimized assembly kernels for CPU architectures
│ ├── alpha
│ ├── arm
│ ├── arm64
│ ├── generic General kernel codes by C.
│ ├── ia64
│ ├── mips64
│ ├── power
│ ├── sparc
│ ├── x86
│ └── x86_64
├── lapack Optimized LAPACK codes
│ ├── getf2
│ ├── getrf
│ ├── getrs
│ ├── laswp
│ ├── lauu2
│ ├── lauum
│ ├── potf2
│ ├── potrf
│ ├── trti2
│ └── trtri
├── lapack-netlib LAPACK codes from netlib
├── reference BLAS Fortran reference implementation
├── test Test codes for BLAS
└── utest Regression test
A call tree for dgemm
is as following.
interface/gemm.c
│
driver/level3/level3.c
│
gemm assembly kernels at kernel/
To find kernel for your architecture, please check kernel/$(ARCH)/KERNEL.$(CPU)
file.
Here is an example for kernel/x86_64/KERNEL.HASWELL
...
DTRMMKERNEL = dtrmm_kernel_4x8_haswell.c
DGEMMKERNEL = dgemm_kernel_4x8_haswell.S
...
According to the above KERNEL.HASWELL
, OpenBLAS Haswell dgemm kernel file is dgemm_kernel_4x8_haswell.S
.
Read the Goto paper to understand the algorithm.
Goto, Kazushige; van de Geijn, Robert A. (2008). "Anatomy of High-Performance Matrix Multiplication". ACM Transactions on Mathematical Software 34 (3): Article 12 (The above link is available only to ACM members, but this and many related papers is also available on the pages of van de Geijn's FLAME project, http://www.cs.utexas.edu/~flame/web/publications.html )
The driver/level3/level3.c
is the implementation of Goto's algorithm. Meanwhile, you can look at kernel/generic/gemmkernel_2x2.c
, which is a naive 2x2
register blocking gemm kernel in C.
Then,
- Write optimized assembly kernels. consider instruction pipeline, available registers, memory/cache accessing
- Tuning cache block size,
Mc
,Kc
, andNc
- Done
We use netlib blas test, cblas test, and LAPACK test. Meanwhile, we use BLAS-Tester, a modified test tool from ATLAS.
- Run
test
andctest
at OpenBLAS. e.g.make test
ormake ctest
. - Run regression test
utest
at OpenBLAS. - Run LAPACK test. e.g.
make lapack-test
. - Clone BLAS-Tester, which can compare the OpenBLAS result with netlib reference BLAS.
We also setup buildbot at http://build.openblas.net