Skip to content

Qualcomm AI Engine Direct - support llama3.2 1B/3B with static llama in kv mode #6779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

shewu-quic
Copy link
Collaborator

summary:

  • Add custom_annotate_llama_last_conv_16a8w
  • Add llama.py and runner for llama3.2 1B/3B
  • Support model sharding and sha

summary:
- Add custom_annotate_llama_last_conv_16a8w
- Add llama.py and runner for llama3.2 1B/3B
- Support model sharding and sha
Copy link

pytorch-bot bot commented Nov 12, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6779

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f2a0383 with merge base dc41596 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 12, 2024
@shewu-quic shewu-quic changed the title Qualcomm AI Engine Direct - support llama3.2 1B/3B with static llama Qualcomm AI Engine Direct - support llama3.2 1B/3B with static llama in kv mode Nov 12, 2024
Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can you share the command line for both AOT and runtime?

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Nov 13, 2024

LGTM! Can you share the command line for both AOT and runtime?

Sure,

End-to-end flow for 16a4w llama 3.2 1B/3B instruct

python examples/qualcomm/oss_scripts/llama3_2/llama.py -a ./llama3_2_1B_512 -b build-android -H ${ADB_HOST} -s ${ADB_SERIAL_NUM} -m "SM8650" --checkpoint ${checkpoint} --params ${params} --tokenizer_model ${tokenizer.model}  --prompt "What is 1+1?" --temperature 0 --seq_len 512  --model_size ${1B/3B} --ptq 16a4w

Compile only for 16a4w llama 3.2 1B/3B instruct

python examples/qualcomm/oss_scripts/llama3_2/llama.py -a ./llama3_2_1B_512 -b build-android -H ${ADB_HOST} -s ${ADB_SERIAL_NUM} -m "SM8650" --checkpoint ${checkpoint} --params ${params} --tokenizer_model ${tokenizer.model}  --prompt "What is 1+1?" --temperature 0 --seq_len 512 --model_size ${1B/3B}--compile_only --ptq 16a4w

Inference only for 16a4w llama 3.2 1B/3B instruct

python examples/qualcomm/oss_scripts/llama3_2/llama.py -a ./llama3_2_1B_512 -b build-android -H ${ADB_HOST} -s ${ADB_SERIAL_NUM} -m "SM8650" --checkpoint ${checkpoint} --params ${params} --tokenizer_model ${tokenizer.model}  --prompt "What is 1+1?" --temperature 0 --seq_len 512 --pre_gen_pte ./llama3_2_1B_512 --model_size ${1B/3B} --ptq 16a4w

@facebook-github-bot facebook-github-bot merged commit 31dbfc9 into pytorch:main Nov 13, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants