-
Notifications
You must be signed in to change notification settings - Fork 12.1k
CUDA: revise q8_1 data layout for mul_mat_q #7824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: revise q8_1 data layout for mul_mat_q #7824
Conversation
ggml-cuda.cu
Outdated
@@ -1347,10 +1347,30 @@ static void ggml_cuda_set_peer_access(const int n_tokens, int main_device) { | |||
GGML_UNUSED(main_device); | |||
} | |||
|
|||
static cudaError_t cudaMemcpy2DPeerAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not good to "impersonate" the CUDA API, it can be confusing and it may cause conflicts in future CUDA versions. This should follow the same naming convention as the rest of the CUDA backend functions, eg. ggml_cuda_xxx
.
const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels, const int64_t kx0_padded, | ||
const ggml_type type_x, cudaStream_t stream); | ||
|
||
void quantize_row_q8_1_cuda( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a big problem at the moment, but it is not great to export symbols with names without any kind of prefix or namespace, since it can easily lead to conflicts with other code. It would probably be easier to just move all the code in the CUDA backend to a namespace, but it's not important right now.
cd46f06
to
05a5fa0
Compare
This PR changes the data layout for the FP32 values quantized to q8_1 in conjunction with MMQ. The blocks with 32 values are consolidated into larger blocks of 128 values in order to make them align to 16 bits and have the same data layout in global and shared memory. This is relevant for asynchronous data loading (Ampere or newer, not in this PR). The memory layout in shared memory now also allows for more efficient data loading for int8 tensor cores; for optimal performance you need an offset of 16 bytes between columns and it just so happens that for 128 values this is the exact size that the q8_1 scales take up.
The q8_1 memory layout in global memory for MMQ is that values are first consolidated into blocks of size 128 along ne10. These blocks are then treated as individual elements and transposed.
In
ggml-cuda.cu
I changed the argument for whether or not src1 should be quantized to a function pointer to the specific function that should be used. The data transfer for--split-mode row
needs to be different. I added a utility functioncudaMemcpy2DPeerAsync
since it for whatever reason does not exist in the CUDA toolkit.Performance on RTX4090/RTX3090 stays mostly the same, the changes for P40/RX 6800 are larger. The performance increases on average.
Specific numbers