-
For example, llama3.2 1b have 16 layers. Can I run the first 8 layers on one machine and then sent kv cache to another machine to run another 8 layers. This is simliar to idea of pipeline parallelism. If not supported now, I wonder can llama.cpp run only partial layers of model or not? If so, I can support pipeline parallelism manually. If llama.cpp can only run the inference as a whole, then I cannot do that. |
Beta Was this translation helpful? Give feedback.
Answered by
egebeysel
Mar 26, 2025
Replies: 1 comment 1 reply
-
I was also looking if llama.cpp, I think this #6017 added pipeline parallelism support. |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
chosen-ox
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I was also looking if llama.cpp, I think this #6017 added pipeline parallelism support.