Skip to content

Commit af7747c

Browse files
taronaeoMQ-mengqingjunchao-loongson
authored
ggml-cpu: Support s390x SIMD Instruction Set (#12019)
* ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <[email protected]> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <[email protected]> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <[email protected]> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <[email protected]> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <[email protected]> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <[email protected]> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <[email protected]> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <[email protected]> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <[email protected]> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <[email protected]> * ggml: remove test.py Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <[email protected]> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <[email protected]> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <[email protected]> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <[email protected]> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <[email protected]> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <[email protected]> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <[email protected]> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <[email protected]> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <[email protected]> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <[email protected]> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <[email protected]> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <[email protected]> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <[email protected]> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <[email protected]> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <[email protected]> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <[email protected]> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <[email protected]> * ggml: fix wrong char*x16_t naming Signed-off-by: Aaron Teo <[email protected]> * ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <[email protected]> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <[email protected]> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <[email protected]> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <[email protected]> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <[email protected]> * ggml : fix LoongArch compile error with 128-bit SIMD (#11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <[email protected]> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]> Co-authored-by: Jinyang He <[email protected]> Co-authored-by: junchao-zhao <[email protected]>
1 parent a28e0d5 commit af7747c

File tree

8 files changed

+826
-1
lines changed

8 files changed

+826
-1
lines changed

ggml/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ endif()
122122
option(GGML_LASX "ggml: enable lasx" ON)
123123
option(GGML_LSX "ggml: enable lsx" ON)
124124
option(GGML_RVV "ggml: enable rvv" ON)
125+
option(GGML_VXE "ggml: enable vxe" ON)
125126

126127
option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
127128
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")

ggml/include/ggml-cpu.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ extern "C" {
9999
// other
100100
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
101101
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
102+
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
102103
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
103104
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
104105

ggml/src/ggml-cpu/CMakeLists.txt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,27 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
310310
if (GGML_RVV)
311311
list(APPEND ARCH_FLAGS -march=rv64gcv -mabi=lp64d)
312312
endif()
313+
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "s390x")
314+
message(STATUS "s390x detected")
315+
file(READ "/proc/cpuinfo" CPUINFO_CONTENTS)
316+
string(REGEX REPLACE "machine[ \t\r\n]*=[ \t\r\n]*([0-9]+)" "\\1" S390X_M ${CPUINFO_CONTENTS})
317+
318+
# TODO: Separation to determine activation of VX/VXE/VXE2
319+
if (${S390X_M} MATCHES "8561|8562")
320+
message(STATUS "z15 target")
321+
list(APPEND ARCH_FLAGS -march=z15 -mtune=z15)
322+
elseif (${S390X_M} MATCHES "3931")
323+
message(STATUS "z16 target")
324+
list(APPEND ARCH_FLAGS -march=z16 -mtune=z16)
325+
else()
326+
message(STATUS "Unknown target")
327+
message(WARNING "Unknown target. If you are compiling for z14 and earlier, you might have to add -DGGML_VXE=OFF.")
328+
list(APPEND ARCH_FLAGS -march=native -mtune=native)
329+
endif()
330+
331+
if (GGML_VXE)
332+
list(APPEND ARCH_FLAGS -mvx -mzvector)
333+
endif()
313334
else()
314335
message(STATUS "Unknown architecture")
315336
endif()

ggml/src/ggml-cpu/ggml-cpu-impl.h

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,15 @@ struct ggml_compute_params {
5959
#endif
6060
#endif
6161

62+
#if defined(__s390x__) && defined(__VEC__)
63+
#ifndef __VXE__
64+
#define __VXE__
65+
#endif
66+
#ifndef __VXE2__
67+
#define __VXE2__
68+
#endif
69+
#endif
70+
6271
#if defined(__ARM_FEATURE_SVE)
6372
#include <arm_sve.h>
6473
#include <sys/prctl.h>
@@ -359,6 +368,148 @@ inline static int32x4_t ggml_vdotq_s32(int32x4_t acc, int8x16_t a, int8x16_t b)
359368
#endif
360369
#endif
361370

371+
#if defined(__VXE__) || defined(__VXE2__)
372+
#include <vecintrin.h>
373+
374+
#define vec_neg(a) (-(a)) // Vector Negate
375+
#define vec_add(a, b) ((a) + (b)) // Vector Add
376+
#define vec_sub(a, b) ((a) - (b)) // Vector Subtract
377+
#define vec_mul(a, b) ((a) * (b)) // Vector Multiply
378+
#define vec_div(a, b) ((a) / (b)) // Vector Divide
379+
#define vec_sl(a, b) ((a) << (b)) // Vector Shift Left
380+
#define vec_sra(a, b) ((a) >> (b)) // Vector Shift Right
381+
#define vec_sr(a, b) ((a) >> (b)) // Vector Shift Right Algebraic
382+
#define vec_slo(a, b) vec_slb(a, (b) << 64) // Vector Shift Left by Octet
383+
#define vec_sro(a, b) vec_srb(a, (b) << 64) // Vector Shift Right by Octet
384+
385+
#ifndef vec_and
386+
#define vec_and(a, b) ((a) & (b)) // Vector AND
387+
#endif
388+
389+
#ifndef vec_or
390+
#define vec_or(a, b) ((a) | (b)) // Vector OR
391+
#endif
392+
393+
#ifndef vec_xor
394+
#define vec_xor(a, b) ((a) ^ (b)) // Vector XOR
395+
#endif
396+
397+
typedef signed char char8x16_t __attribute__((vector_size(16)));
398+
typedef unsigned char uchar8x16_t __attribute__((vector_size(16)));
399+
400+
typedef int8_t int8x16_t __attribute__((vector_size(16)));
401+
typedef int16_t int16x8_t __attribute__((vector_size(16)));
402+
typedef int32_t int32x4_t __attribute__((vector_size(16)));
403+
404+
typedef uint8_t uint8x16_t __attribute__((vector_size(16)));
405+
typedef uint16_t uint16x8_t __attribute__((vector_size(16)));
406+
typedef uint32_t uint32x4_t __attribute__((vector_size(16)));
407+
408+
typedef float float32x4_t __attribute__((vector_size(16)));
409+
typedef double double64x2_t __attribute((vector_size(16)));
410+
411+
typedef signed long long long64x2_t __attribute((vector_size(16)));
412+
typedef unsigned long long ulong64x2_t __attribute__((vector_size(16)));
413+
414+
typedef struct ggml_uint8x16x2_t {
415+
uint8x16_t val[2];
416+
} ggml_uint8x16x2_t;
417+
418+
inline static ggml_uint8x16x2_t ggml_vec_xl_u8x2(const uint8_t * ptr) {
419+
ggml_uint8x16x2_t res;
420+
421+
res.val[0] = vec_xl( 0, ptr);
422+
res.val[1] = vec_xl(16, ptr);
423+
424+
return res;
425+
}
426+
427+
typedef struct ggml_uint8x16x4_t {
428+
uint8x16_t val[4];
429+
} ggml_uint8x16x4_t;
430+
431+
inline static ggml_uint8x16x4_t ggml_vec_xl_u8x4(const uint8_t * ptr) {
432+
ggml_uint8x16x4_t res;
433+
434+
res.val[0] = vec_xl( 0, ptr);
435+
res.val[1] = vec_xl(16, ptr);
436+
res.val[2] = vec_xl(32, ptr);
437+
res.val[3] = vec_xl(48, ptr);
438+
439+
return res;
440+
}
441+
442+
typedef struct ggml_int8x16x4_t {
443+
int8x16_t val[4];
444+
} ggml_int8x16x4_t;
445+
446+
inline static ggml_int8x16x4_t ggml_vec_xl_s8x4(const int8_t * ptr) {
447+
ggml_int8x16x4_t res;
448+
449+
res.val[0] = vec_xl( 0, ptr);
450+
res.val[1] = vec_xl(16, ptr);
451+
res.val[2] = vec_xl(32, ptr);
452+
res.val[3] = vec_xl(48, ptr);
453+
454+
return res;
455+
}
456+
457+
typedef struct ggml_int16x8x2_t {
458+
int16x8_t val[2];
459+
} ggml_int16x8x2_t;
460+
461+
inline static ggml_int16x8x2_t ggml_vec_xl_s16x2(const int16_t * ptr) {
462+
ggml_int16x8x2_t res;
463+
464+
res.val[0] = vec_xl( 0, ptr);
465+
res.val[1] = vec_xl(16, ptr);
466+
467+
return res;
468+
}
469+
470+
/*
471+
! WARNING: Very slow. Use vec_perm if possible. Refer to iq4_xs
472+
! or iq4_nl for example implementation.
473+
*/
474+
inline static int8x16_t ggml_vec_tbl(int8x16_t a, uint8x16_t b) {
475+
int8x16_t res;
476+
477+
res[ 0] = a[b[ 0]];
478+
res[ 1] = a[b[ 1]];
479+
res[ 2] = a[b[ 2]];
480+
res[ 3] = a[b[ 3]];
481+
res[ 4] = a[b[ 4]];
482+
res[ 5] = a[b[ 5]];
483+
res[ 6] = a[b[ 6]];
484+
res[ 7] = a[b[ 7]];
485+
res[ 8] = a[b[ 8]];
486+
res[ 9] = a[b[ 9]];
487+
res[10] = a[b[10]];
488+
res[11] = a[b[11]];
489+
res[12] = a[b[12]];
490+
res[13] = a[b[13]];
491+
res[14] = a[b[14]];
492+
res[15] = a[b[15]];
493+
494+
return res;
495+
}
496+
497+
inline static int16x8_t vec_padd_s16(int16x8_t a, int16x8_t b) {
498+
const uchar8x16_t v_maske = { 0, 1, 4, 5, 8, 9, 12, 13,
499+
16, 17, 20, 21, 24, 25, 28, 29 };
500+
501+
const int16x8_t v_abo = vec_pack((int32x4_t)a, (int32x4_t)b);
502+
const int16x8_t v_abe = vec_perm(a, b, v_maske);
503+
return v_abo + v_abe;
504+
}
505+
506+
inline static int32x4_t ggml_vec_dot(int32x4_t acc, int8x16_t a, int8x16_t b) {
507+
const int16x8_t p = vec_mule(a, b) + vec_mulo(a, b);
508+
return acc + (vec_unpackh(p) + vec_unpackl(p));
509+
}
510+
511+
#endif
512+
362513
#if defined(__loongarch_asx)
363514
/* float type data load instructions */
364515
static __m128 __lsx_vreplfr2vr_s(const float val) {

0 commit comments

Comments
 (0)