GGUF models inference speed - Why is GGUF model inference fast on my Mac but slow on cluster? #7717

eltonjohnfanboy · 2024-06-03T15:42:09Z

eltonjohnfanboy
Jun 3, 2024

Hi guys!

I've noticed that GGUF model inference is much faster on my Mac M3 compared to my college's cluster, even when I request for 8 or 16 cores. Both systems run the same GGUF model version and dependencies. The inference in the MAC takes seconds while in the cluster it can take up to 1 hour to generate the response.
Are there known issues with GGUF models on certain CPUs? Any help would be greatly appreciated. Thank you!

metal3d · 2024-06-05T09:27:24Z

metal3d
Jun 5, 2024

On your Mac, you probably compiled llama.cpp with "Metal" support, that makes inferences usings your GPU. On a CPU, it's slower, even using AVX/AVX2 instructions.

4 replies

eltonjohnfanboy Jun 12, 2024
Author

Hi @metal3d!
Thanks for your answer. I was wondering about something. Even when using the llama-cpp-python binding, where you can specify the n_gpu_layers to zero, I still see the same issue happening, it's way faster in MAC than in cluster. Shouldn't the speeds be similar if you specify not to use any GPU layers? Or am I missing something?
Thanks a lot!

metal3d Jun 12, 2024

You should set -1 to load everything in GPU with the python binding.

For Mac and benchmark, sorry, I haven't any more information.

eltonjohnfanboy Jun 14, 2024
Author

Great! Thanks for you answer @metal3d :))

jeanjerome Mar 30, 2025

Thanks a lot @metal3d for the n_gpu_layers=-1 setting to leverage the Apple Silicon GPU. I was starting to get discouraged seeing my CPU at 100% while my GPU was idle. I couldn't find this information anywhere else (llama.cpp + llama-cpp-bindings), or it's very well hidden!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GGUF models inference speed - Why is GGUF model inference fast on my Mac but slow on cluster? #7717

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GGUF models inference speed - Why is GGUF model inference fast on my Mac but slow on cluster? #7717

Uh oh!

eltonjohnfanboy Jun 3, 2024

Replies: 1 comment · 4 replies

Uh oh!

metal3d Jun 5, 2024

Uh oh!

eltonjohnfanboy Jun 12, 2024 Author

Uh oh!

metal3d Jun 12, 2024

Uh oh!

eltonjohnfanboy Jun 14, 2024 Author

Uh oh!

jeanjerome Mar 30, 2025

eltonjohnfanboy
Jun 3, 2024

Replies: 1 comment 4 replies

metal3d
Jun 5, 2024

eltonjohnfanboy Jun 12, 2024
Author

eltonjohnfanboy Jun 14, 2024
Author