Skip to content

Qualcomm AI Engine Direct - Refine Llama3 Tokenizer #4940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 29, 2024

Conversation

winskuo-quic
Copy link
Collaborator

Summary

  • A community user has reported an issue(Qualcomm AI Engine Direct - Support Llama3 QAIHub #4789 (comment)) that qaihub llama3 runner does not end text generation when hitting EOT. I have followed the model card format to feed the prompt to the model and also enable the option for users to set system prompt. Below are some examples:

    • Prompt: "What is 2+3?"

      • Response:

            <|begin_of_text|><|start_header_id|>user<|end_header_id|>
            
            What is 2+3?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
            
            That's an easy one!
            
            2 + 3 = 5<|eot_id|>
        
    • Prompt: "What is 2+3?" System_Prompt: "You are a bad assistant that thinks 2+3=4"

      • Response:

            <|begin_of_text|><|start_header_id|>system<|end_header_id|>`
          
            You are a bad assistant that thinks 2+3=4<|eot_id|>
            <|start_header_id|>user<|end_header_id|>
            
            What is 2+3?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
            
            That's an easy one! 2+3 is... 4!<|eot_id|> 
        
  • The same community user has also reported another issue(Qualcomm AI Engine Direct - Support Llama3 QAIHub #4789 (comment)) where there are currently some documents build bath using cmake-out-android and some using build-android. This PR has updated the documents and change cmake-out-android to build-android.

  • This PR also address the issue suggested in Qualcomm AI Engine Direct - QAIHub's context binary file for Stable Diffusion #4836 (review). Boiler plate codes are now moved to utils.py.

  • Update build.sh from $BUILD_ROOT/sdk to $BUILD_ROOT/devtools

Copy link

pytorch-bot bot commented Aug 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4940

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit f162b71 with merge base 801e1c9 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2024
@winskuo-quic
Copy link
Collaborator Author

Hi @cccclai,
This PR is to address 2 main issues:

  1. Llama3 QAIHub runner cannot hit EOT condition and will keep generate text until reaching max sequence length. I have added condition to stop text generation when hitting EOT.
  2. You have previously suggested to reduce boiler plate code(Qualcomm AI Engine Direct - QAIHub's context binary file for Stable Diffusion #4836 (review)), so I have moved them to utils.py.

Please have a look.
Thanks

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for sending the pr!

@@ -239,7 +239,7 @@ We can test model inferences before deploying it to a device by HTP emulator.
Let's build `qnn_executor_runner` for a x64 host:
```bash
# assuming the AOT component is built.
cd $EXECUTORCH_ROOT/cmake-out
cd $EXECUTORCH_ROOT/build-x86
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. I reverted the change to use build-x86 instead and it seems like some cases are missing

@@ -230,7 +220,7 @@ def post_process():
parser.add_argument(
"--temperature",
help="sampling temperature for llama2",
default=0.8,
default=0.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason we're using 0 temperature?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We change the default to 0 because the output can be more consistent, which is better for testing purposes.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the best way to test this flow - but it will be great to have CI setup.

@cccclai cccclai merged commit 0c6a77e into pytorch:main Aug 29, 2024
43 of 46 checks passed
@a21550
Copy link

a21550 commented Sep 4, 2024

Just for record, this patch works perfectly for me with QNN 2.24.0.240626. However, the context binaries/pte files generated with QNN 2.25.0.240728 will generate garbage output. Once I switched back to the context binaries/pte files generated with QNN 2.24.0.240626, things were back to normal.

@winskuo-quic
Copy link
Collaborator Author

Just for record, this patch works perfectly for me with QNN 2.24.0.240626. However, the context binaries/pte files generated with QNN 2.25.0.240728 will generate garbage output. Once I switched back to the context binaries/pte files generated with QNN 2.24.0.240626, things were back to normal.

Hi @a21550,
Thank you so much for sharing the information!
For the time being, it would be recommended to ensure that the QNN version for context binaries matches the QNN version used to generate PTE files. I believe QAIHub currently uses QNN 2.24.0 to generate Llama3 context binaries.

@winskuo-quic
Copy link
Collaborator Author

Hi @a21550,
To further align with your environment, it would be appreciated if you can share the following information with us:

  1. Could you share the model of your Android device?
  2. For your Android device, are you using root access?

Thanks!

@a21550
Copy link

a21550 commented Oct 1, 2024

Hi @winskuo-quic ,

  1. My development device is customized Motorola razr+ 2024 with SM8650 and 16GB DDR
  2. Yes, I have root

Thanks!

Hi @a21550, To further align with your environment, it would be appreciated if you can share the following information with us:

  1. Could you share the model of your Android device?
  2. For your Android device, are you using root access?

Thanks!

@winskuo-quic
Copy link
Collaborator Author

Hi @winskuo-quic ,

  1. My development device is customized Motorola razr+ 2024 with SM8650 and 16GB DDR
  2. Yes, I have root

Thanks!

Hi @a21550, To further align with your environment, it would be appreciated if you can share the following information with us:

  1. Could you share the model of your Android device?
  2. For your Android device, are you using root access?

Thanks!

Thank you so much for sharing this valuable information!
This will definitely help us further improve our code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants