Can llama.cpp support run different attentions layers on different devices? #10300

chosen-ox · 2024-11-15T02:36:14Z

chosen-ox
Nov 15, 2024

For example, llama3.2 1b have 16 layers. Can I run the first 8 layers on one machine and then sent kv cache to another machine to run another 8 layers. This is simliar to idea of pipeline parallelism. If not supported now, I wonder can llama.cpp run only partial layers of model or not? If so, I can support pipeline parallelism manually. If llama.cpp can only run the inference as a whole, then I cannot do that.

Answered by egebeysel

Mar 26, 2025

I was also looking if llama.cpp, I think this #6017 added pipeline parallelism support.

View full answer

egebeysel · 2025-03-26T16:22:23Z

egebeysel
Mar 26, 2025

I was also looking if llama.cpp, I think this #6017 added pipeline parallelism support.

1 reply

chosen-ox Mar 26, 2025
Author

Thank you for your reply. I appreciate it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can llama.cpp support run different attentions layers on different devices? #10300

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can llama.cpp support run different attentions layers on different devices? #10300

Uh oh!

chosen-ox Nov 15, 2024

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

egebeysel Mar 26, 2025

Uh oh!

chosen-ox Mar 26, 2025 Author

chosen-ox
Nov 15, 2024

Replies: 1 comment 1 reply

egebeysel
Mar 26, 2025

chosen-ox Mar 26, 2025
Author