-
Notifications
You must be signed in to change notification settings - Fork 12.2k
metal: concurrently dispatch commands #2358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal: concurrently dispatch commands #2358
Conversation
Function `ggml_metal_graph_find_concurrency` will run and write commands that can be issued concurrently to metal context `concur_list` array, when `ggml_metal_graph_compute` is called for the first time.
ggml-metal.m
Outdated
if (!ctx->concur_list_len) { | ||
ggml_metal_graph_find_concurrency(ctx,gf); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this will break when computing graphs of different topologies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now we can assume that during the lifetime of metal ctx
, the topology of graph won't change. In future, when we need to change the topology of graph and we also have a mechanism to tell the backend that the graph topology is changed, we can easily add the necessary code to address the updated topology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For llama that's true, but there are other users of ggml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of letting ggml-metal.m
automatically call ggml_metal_graph_find_concurrency
, we let llama.cpp
decide if we should call ggml_metal_graph_find_concurrency
and set the metal_ctx->concur_list
?
The logic on ggml-metal.m
will be not so intrusive: If metal_ctx->concur_list
is set then dispatch ops concurrently, otherwise fallback to the original code path.
This will add backend-specific code in llama.cpp
for now, but I imagine that in future we can have a ggml_backend_graph_optimize()
interface to unify them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Eventually we will need a better solution, but for now that should do. @ggerganov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On M1 Pro with Q4_0 7B the times are:
master
: 28.5 ms/t- PR: 28.2 ms/t
Discussion see #2309
Function
ggml_metal_graph_find_concurrency
will write commands that can be issued concurrently to metal contextconcur_list
array, whenggml_metal_graph_compute
is called for the first time.Tested on M1 Max 32c gpu with :
./main -m model_file -n 256 -c 512 -s 123 -p "I believe the meaning of life is" --ignore-eos -ngl 1 --no-mmap -t 8
Btw, since we already have much faster kernels for both
Q_K
andnon Q_K
, is it a good time to do prompt evaluation on metal? Although current kernels are optimized for mat-vec multiplications, they still provide much better performance than cpu. (EDIT: not true for M1 Pro)