GGUF models inference speed - Why is GGUF model inference fast on my Mac but slow on cluster? #7717
Unanswered
eltonjohnfanboy
asked this question in
Q&A
Replies: 1 comment 4 replies
-
On your Mac, you probably compiled llama.cpp with "Metal" support, that makes inferences usings your GPU. On a CPU, it's slower, even using AVX/AVX2 instructions. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi guys!
I've noticed that GGUF model inference is much faster on my Mac M3 compared to my college's cluster, even when I request for 8 or 16 cores. Both systems run the same GGUF model version and dependencies. The inference in the MAC takes seconds while in the cluster it can take up to 1 hour to generate the response.
Are there known issues with GGUF models on certain CPUs? Any help would be greatly appreciated. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions