-
Notifications
You must be signed in to change notification settings - Fork 608
Qualcomm AI Engine Direct - Refine Llama3 Tokenizer #4940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4940
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit f162b71 with merge base 801e1c9 ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @cccclai,
Please have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for sending the pr!
@@ -239,7 +239,7 @@ We can test model inferences before deploying it to a device by HTP emulator. | |||
Let's build `qnn_executor_runner` for a x64 host: | |||
```bash | |||
# assuming the AOT component is built. | |||
cd $EXECUTORCH_ROOT/cmake-out | |||
cd $EXECUTORCH_ROOT/build-x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this. I reverted the change to use build-x86 instead and it seems like some cases are missing
@@ -230,7 +220,7 @@ def post_process(): | |||
parser.add_argument( | |||
"--temperature", | |||
help="sampling temperature for llama2", | |||
default=0.8, | |||
default=0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason we're using 0 temperature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We change the default to 0 because the output can be more consistent, which is better for testing purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the best way to test this flow - but it will be great to have CI setup.
Just for record, this patch works perfectly for me with QNN 2.24.0.240626. However, the context binaries/pte files generated with QNN 2.25.0.240728 will generate garbage output. Once I switched back to the context binaries/pte files generated with QNN 2.24.0.240626, things were back to normal. |
Hi @a21550, |
Hi @a21550,
Thanks! |
Hi @winskuo-quic ,
Thanks!
|
Thank you so much for sharing this valuable information! |
Summary
A community user has reported an issue(Qualcomm AI Engine Direct - Support Llama3 QAIHub #4789 (comment)) that qaihub llama3 runner does not end text generation when hitting EOT. I have followed the model card format to feed the prompt to the model and also enable the option for users to set system prompt. Below are some examples:
Prompt: "What is 2+3?"
Response:
Prompt: "What is 2+3?" System_Prompt: "You are a bad assistant that thinks 2+3=4"
Response:
The same community user has also reported another issue(Qualcomm AI Engine Direct - Support Llama3 QAIHub #4789 (comment)) where there are currently some documents build bath using cmake-out-android and some using build-android. This PR has updated the documents and change cmake-out-android to build-android.
This PR also address the issue suggested in Qualcomm AI Engine Direct - QAIHub's context binary file for Stable Diffusion #4836 (review). Boiler plate codes are now moved to utils.py.
Update build.sh from $BUILD_ROOT/sdk to $BUILD_ROOT/devtools