-
Notifications
You must be signed in to change notification settings - Fork 607
Qualcomm AI Engine Direct - Optimization in static llama #6849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Optimization in static llama #6849
Conversation
summary: - Fuse rms norm - Improve performance of div op - Fixed 16a8w annotation for matmul op
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6849
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit 1732d06 with merge base 21eecff ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @cccclai, Here is some optimization to reproduce the performance. Thanks a lot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Still working on repro the number...
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Here is my latency number, without this pr, on main
with this pr:
|
Did It test on Gen 3? |
I set the prompt and seq_len, just to be accurate, and following is the number:
|
Yeah, one plus 12 (sm8650, 16G RAM) |
oh do you know which one is wrong? Naveen just sent an email regarding the accuracy issue between fake quantized model vs on device model, can it be related? |
I am not sure does it related. I find we annotate linear with MovingAverageMinMaxObserver, but annotate other ops with MinMaxObserver in llama. It will make quant attr of some ops unreasonable. Such as transpose op, we should expect the same quant attr between I/O. But we get different value due to observer mismatch. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
Hi @cccclai, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks!
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
elif node.target == torch.ops.aten.cat.default: | ||
annotate_cat(node, quantization_config_8a8w) | ||
node = node.args[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What pattern is this trying to capture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following pattern.
q (16 bits) -------\
matmul op (16 bits)
past k / v (8 bits) -------\
cat op (8 bits) ----/
new k / v (8 bits)---------/
(transpose after k)
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Looks like it's added in pytorch#6849, maybe it was using the old api for the default 8bit quantization Differential Revision: D66219251
Summary: Looks like it's added in #6849, maybe it was using the old api for the default 8bit quantization Differential Revision: D66219251
summary: