Skip to content

metal: concurrently dispatch commands #2358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 25, 2023

Conversation

lshzh-ww
Copy link
Contributor

@lshzh-ww lshzh-ww commented Jul 24, 2023

Discussion see #2309

Function ggml_metal_graph_find_concurrency will write commands that can be issued concurrently to metal context concur_list array, when ggml_metal_graph_compute is called for the first time.

Tested on M1 Max 32c gpu with :
./main -m model_file -n 256 -c 512 -s 123 -p "I believe the meaning of life is" --ignore-eos -ngl 1 --no-mmap -t 8

model master PR improvement
33B Q4_0 73.0 ms/tok 70.1 ms/tok ~4%
7B Q4_0 19.50 ms/tok 18.26 ms/tok ~6%

Btw, since we already have much faster kernels for both Q_K and non Q_K, is it a good time to do prompt evaluation on metal? Although current kernels are optimized for mat-vec multiplications, they still provide much better performance than cpu. (EDIT: not true for M1 Pro)

Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.
@lshzh-ww lshzh-ww requested a review from ggerganov July 24, 2023 05:46
ggml-metal.m Outdated
Comment on lines 445 to 447
if (!ctx->concur_list_len) {
ggml_metal_graph_find_concurrency(ctx,gf);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this will break when computing graphs of different topologies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we can assume that during the lifetime of metal ctx, the topology of graph won't change. In future, when we need to change the topology of graph and we also have a mechanism to tell the backend that the graph topology is changed, we can easily add the necessary code to address the updated topology.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For llama that's true, but there are other users of ggml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of letting ggml-metal.m automatically call ggml_metal_graph_find_concurrency, we let llama.cpp decide if we should call ggml_metal_graph_find_concurrency and set the metal_ctx->concur_list?

The logic on ggml-metal.m will be not so intrusive: If metal_ctx->concur_list is set then dispatch ops concurrently, otherwise fallback to the original code path.

This will add backend-specific code in llama.cpp for now, but I imagine that in future we can have a ggml_backend_graph_optimize() interface to unify them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Eventually we will need a better solution, but for now that should do. @ggerganov

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On M1 Pro with Q4_0 7B the times are:

  • master: 28.5 ms/t
  • PR: 28.2 ms/t

@ggerganov ggerganov merged commit 1aa18ef into ggml-org:master Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants