-
Notifications
You must be signed in to change notification settings - Fork 608
Qualcomm AI Engine Direct - Suport batch prefill mode for llama3.2 #6983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
chunit-quic
commented
Nov 20, 2024
- Enable bert mode
- Change input sequence of static_llama
- Tag bert output as uint8
- Unify both 1b and 3b in 1 runner
- Add hybrid IO memory for llama3_2 runner
- Align timer with llama
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6983
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled JobAs of commit ef2e1e5 with merge base a347665 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hey do you mind sharing the command for AoT and runtime so I can try on my end? |
Sure! Change different mode (kv or bert )by set up the argument python examples/qualcomm/oss_scripts/llama3_2/llama.py -a ${ARCHIVE}/ -b build-android -H ${HOST} -s ${DEVICE}-m ${SOC} --checkpoint Llama3.2-1B-Instruct/consolidated.00.pth --params Llama3.2-1B-Instruct/params.json --tokenizer_model Llama3.2-1B-Instruct/tokenizer.model --prompt "<|start_header_id|>" --ptq 16a4w --temperature 0 --model_size 1B --seq_len 16 --model_mode bert |
Ah I see - do you mind rename bert mode to batch_prefill? The context is that bert isn't a common name.. |
5dc7b3f
to
0cff7c9
Compare
No problem. let me change it |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There are some errors here
|
Thanks for pointing out. Fixed. |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
install_requirements.py
Outdated
@@ -137,7 +137,7 @@ def python_is_compatible(): | |||
"timm==1.0.7", | |||
f"torchaudio==2.5.0.{NIGHTLY_VERSION}" if USE_PYTORCH_NIGHTLY else "torchaudio", | |||
"torchsr==1.0.4", | |||
"transformers==4.46.1", | |||
"transformers==4.42.4", # TODO update back to 4.46.1 once the error is fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the issue with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for confusing. This is for our internal CI. Removed.
One more thing: just in case you want to reproduce the performance profiling results right now, we are still checking and working on some related passes. It might be better to wait for our next profiling results to see if the execution times are aligned. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's bring more CI to oss, so they can be caught when creating the PR
Still has lint error... |
ff3ce36
to
08c4742
Compare
|
I'm getting following error:
The repro step is
|
- Enable bert mode - Change input sequence of static_llama - Tag bert output as uint8 - Unify both 1b and 3b in 1 runner - Add hybrid IO memory for llama3_2 runner - Align timer with llama
- Fix rebase conflict - Change input dtype of calibration function
767887d
to
ef2e1e5
Compare
Sure, just rebased.
May I know which QNN version did you use? It seems to me that it might relate to PR6811. I just tested with smaller one(1layer) without encountering an error. My qnn sdk version is 2.26.1 |
I'm using the version downloaded from https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip Maybe let me get your latest pr and see if it passes. Seems like both batch_prefill and kv fails |
Is the PR tested on qnn 2.28? |
No, we test on qnn 2.26 previously. I'm running batch_prefill mode with 16 layers now with clean build again, and will do kv mode later. |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi @cccclai I run and succeed both kv and batch_prefill mode (16 layers, normal size) with the command below by changing the arg If you would like to export and execute seperately, maybe you could consider using Please let me know if you are still facing the error. Would try to find it out. :) python examples/qualcomm/oss_scripts/llama3_2/llama.py -a ./${bert_test} -b build-android -H ${HOST} -s ${DEVICE} -m "SM8650" --checkpoint ${Llama3.2-1B-Instruct}/consolidated.00.pth --params ${Llama3.2-1B-Instruct}/params.json --tokenizer_model ${Llama3.2-1B-Instruct}/tokenizer.model --prompt "<|start_header_id|>" --ptq 16a4w --temperature 0 --model_size 1B --seq_len 16 --model_mode batch_prefill |
Sort of orthogonal - did we verify the runner cpp correctness? We have a calibrated quantized model which shows reasonable result, but the runner gives purely non-sense result. |
Yes, we performed some correctness checks.
We will update another PR based on this one recently, which supports hybrid mode. Maybe we can check correctness using that PR? Any other ideas are appreciated. :D |
Ah yes, that will be great. If we enable stories model, then we can add it to CI easily |
With that being said, should we just aim to merge that PR instead? and keep this PR on hold? |
No, I would say we can merge this one first. Then the upcoming PR will have less code change. |