Skip to content

Commit 8d94713

Browse files
authored
docs: add s390x build documentation (#14264)
* docs: add s390x-specific build docs Signed-off-by: Aaron Teo <[email protected]> * docs: add s390x model conversion steps Signed-off-by: Aaron Teo <[email protected]> * docs: s390x build indent Signed-off-by: Aaron Teo <[email protected]> * docs: update hyperlinks for s390x docs Signed-off-by: Aaron Teo <[email protected]> * docs: update llama.h docs Signed-off-by: Aaron Teo <[email protected]> * docs: s390x add accelerator and perf optimizations Signed-off-by: Aaron Teo <[email protected]> * docs: s390x indent blocks Signed-off-by: Aaron Teo <[email protected]> * docs: revert block indentation Signed-off-by: Aaron Teo <[email protected]> * docs: add support information for s390x Signed-off-by: Aaron Teo <[email protected]> * docs: s390x reword Signed-off-by: Aaron Teo <[email protected]> * docs: remove indentation for accelerator section s390x Signed-off-by: Aaron Teo <[email protected]> * docs: remove redundant words s390x Signed-off-by: Aaron Teo <[email protected]> * docs: reword for s390x Signed-off-by: Aaron Teo <[email protected]> * docs: s390x reword simd Signed-off-by: Aaron Teo <[email protected]> * docs: fix trailing whitespace for s390x Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
1 parent 50d2227 commit 8d94713

File tree

1 file changed

+157
-0
lines changed

1 file changed

+157
-0
lines changed

docs/build-s390x.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
> [!IMPORTANT]
2+
> This build documentation is specific only to IBM Z & LinuxONE mainframes (s390x). You can find the build documentation for other architectures: [build.md](build.md).
3+
4+
# Build llama.cpp locally (for s390x)
5+
6+
The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](../include/llama.h).
7+
8+
The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server.
9+
10+
**To get the code:**
11+
12+
```bash
13+
git clone https://github.com/ggml-org/llama.cpp
14+
cd llama.cpp
15+
```
16+
17+
## CPU Build with BLAS
18+
19+
Building llama.cpp with BLAS support is highly recommended as it has shown to provide performance improvements.
20+
21+
```bash
22+
cmake -S . -B build \
23+
-DCMAKE_BUILD_TYPE=Release \
24+
-DGGML_BLAS=ON \
25+
-DGGML_BLAS_VENDOR=OpenBLAS
26+
27+
cmake --build build --config Release -j $(nproc)
28+
```
29+
30+
**Notes**:
31+
- For faster repeated compilation, install [ccache](https://ccache.dev/)
32+
- By default, VXE/VXE2 is enabled. To disable it (not recommended):
33+
34+
```bash
35+
cmake -S . -B build \
36+
-DCMAKE_BUILD_TYPE=Release \
37+
-DGGML_BLAS=ON \
38+
-DGGML_BLAS_VENDOR=OpenBLAS \
39+
-DGGML_VXE=OFF
40+
41+
cmake --build build --config Release -j $(nproc)
42+
```
43+
44+
- For debug builds:
45+
46+
```bash
47+
cmake -S . -B build \
48+
-DCMAKE_BUILD_TYPE=Debug \
49+
-DGGML_BLAS=ON \
50+
-DGGML_BLAS_VENDOR=OpenBLAS
51+
52+
cmake --build build --config Debug -j $(nproc)
53+
```
54+
55+
- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
56+
57+
```bash
58+
cmake -S . -B build \
59+
-DCMAKE_BUILD_TYPE=Release \
60+
-DGGML_BLAS=ON \
61+
-DGGML_BLAS_VENDOR=OpenBLAS \
62+
-DBUILD_SHARED_LIBS=OFF
63+
64+
cmake --build build --config Release -j $(nproc)
65+
```
66+
67+
## Getting GGUF Models
68+
69+
All models need to be converted to Big-Endian. You can achieve this in three cases:
70+
71+
1. **Use pre-converted models verified for use on IBM Z & LinuxONE (easiest)**
72+
73+
You can find popular models pre-converted and verified at [s390x Ready Models](hf.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
74+
75+
These models and their respective tokenizers are verified to run correctly on IBM Z & LinuxONE.
76+
77+
2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
78+
79+
```bash
80+
python3 convert_hf_to_gguf.py \
81+
--outfile model-name-be.f16.gguf \
82+
--outtype f16 \
83+
--bigendian \
84+
model-directory/
85+
```
86+
87+
For example,
88+
89+
```bash
90+
python3 convert_hf_to_gguf.py \
91+
--outfile granite-3.3-2b-instruct-be.f16.gguf \
92+
--outtype f16 \
93+
--bigendian \
94+
granite-3.3-2b-instruct/
95+
```
96+
97+
3. **Convert existing GGUF Little-Endian model to Big-Endian**
98+
99+
```bash
100+
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
101+
```
102+
103+
For example,
104+
```bash
105+
python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG
106+
mv granite-3.3-2b-instruct-le.f16.gguf granite-3.3-2b-instruct-be.f16.gguf
107+
```
108+
109+
**Notes:**
110+
- The GGUF endian conversion script may not support all data types at the moment and may fail for some models/quantizations. When that happens, please try manually converting the safetensors model to GGUF Big-Endian via Step 2.
111+
112+
## IBM Accelerators
113+
114+
### 1. SIMD Acceleration
115+
116+
Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14 or EC13. In such systems, the APIs can still run but will use a scalar implementation.
117+
118+
### 2. zDNN Accelerator
119+
120+
*Only available in IBM z16 or later system. No direction at the moment.*
121+
122+
### 3. Spyre Accelerator
123+
124+
*No direction at the moment.*
125+
126+
## Performance Tuning
127+
128+
### 1. Virtualization Setup
129+
130+
It is strongly recommended to use only LPAR (Type-1) virtualization to get the most performance.
131+
132+
Note: Type-2 virtualization is not supported at the moment, while you can get it running, the performance will not be the best.
133+
134+
### 2. IFL (Core) Count
135+
136+
It is recommended to allocate a minimum of 8 shared IFLs assigned to the LPAR. Increasing the IFL count past 8 shared IFLs will only improve Prompt Processing performance but not Token Generation.
137+
138+
Note: IFL count does not equate to vCPU count.
139+
140+
### 3. SMT vs NOSMT (Simultaneous Multithreading)
141+
142+
It is strongly recommended to disable SMT via the kernel boot parameters as it negatively affects performance. Please refer to your Linux distribution's guide on disabling SMT via kernel boot parameters.
143+
144+
### 4. BLAS vs NOBLAS
145+
146+
IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongly recommended to use BLAS.
147+
148+
## Getting Help on IBM Z & LinuxONE
149+
150+
1. **Bugs, Feature Requests**
151+
152+
Please file an issue in llama.cpp and ensure that the title contains "s390x".
153+
154+
2. **Other Questions**
155+
156+
Please reach out directly to [[email protected]](mailto:[email protected]).
157+

0 commit comments

Comments
 (0)