-
Notifications
You must be signed in to change notification settings - Fork 608
Qualcomm AI Engine Direct - Support Llama3 QAIHub #4789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Support Llama3 QAIHub #4789
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4789
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9a3f302 with merge base 447dc6c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
d6fd817
to
9a3f302
Compare
Hi @cccclai, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this is great! Wonder if we have reference latency/accuracy/ram usage number?
# TODO: QNN seems to have an expected spill fill size that can be found through log. | ||
# Find a way to set this value instead of manually go through the log to retrieve the value. | ||
custom_spill_fill = 128974848 if args.use_prompt_processor else 3932160 | ||
# setup spill-fill buffer for relieving runtime memory usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know the ram usage and latency number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cccclai ,
Thanks a lot for reviewing the PR.
-
Latency: We are currently testing this on our engineering device where we can get 13tok/sec for KV Cache mode and 0.8tok/sec for Bert mode.
-
Accuracy: We currently don't have specific accuracy metrics like perplexity available. We tested with examples such as
Prompt: "What is baseball?"
Response: "It is a game of skill, strategy, and physical ability. It is played by two teams, each with nine players. The game is played on a diamond-shaped field, with a pitcher's mound at one corner. The objective of the game is to score more runs than the opposing team by hitting the ball with a bat and running around the four bases on the field. The team with the most runs at the end of nine innings wins the game. The game of baseball is a classic and timeless game that is enjoyed by people all over the world. It is a game that is easy to learn but difficult to" -
Memory: The memory usage for both KV Cache Mode and Bert Mode are both around 11GB. We have tested on a 16GB engineering device and verified to work.
Please let me know if you have any other questions.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, you mentioned the memory usage for you was around 11GB. I used both "top" and "dumpsys" to check the physical RAM usage, and both showed 5GB for me on SM8650 as below. Were you using a different way to check RAM usage? I'm assuming context size is 1024 as well for your testing?
=======================================================================
Tasks: 838 total, 2 running, 836 sleeping, 0 stopped, 0 zombie
Mem: 15267M total, 15129M used, 138M free, 1M buffers
Swap: 9924M total, 756M used, 9167M free, 6231M cached
800%cpu 56%user 0%nice 41%sys 696%idle 0%iow 7%irq 1%sirq 0%host
PID USER PR NI VIRT RES SHR S[%CPU] %MEM TIME+ ARGS
25091 shell 20 0 22G 4.8G 4.7G R 72.0 32.5 1:18.94 qaihub_llama3_8b_runner --sharded_1_path qaihub_llama3_8b_token_0.pte --sharded_2_path qaihub_llama3_8b_token_1.pte --sharded_3_path qaihub+
=======================================================================
a21550@a21550:Works$ adb shell dumpsys meminfo 25091
Applications Memory Usage (in Kilobytes):
Uptime: 102398639 Realtime: 102398639
Pss Private Private Swap Rss Heap Heap Heap
Total Dirty Clean Dirty Total Size Alloc Free
------ ------ ------ ------ ------ ------ ------ ------
Native Heap 97688 97688 0 0 97688 0 0 0
Dalvik Heap 0 0 0 0 0 0 0 0
Stack 392 392 0 0 392
Other dev 4 0 4 0 336
.so mmap 3852 356 3364 0 6720
Other mmap 4984258 76 4984180 0 4984820
Unknown 1456 1456 0 0 1456
TOTAL 5087650 99968 4987548 0 5091412 0 0 0
App Summary
Pss(KB) Rss(KB)
------ ------
Java Heap: 0 0
Native Heap: 97688 97688
Code: 3720 6720
Stack: 392 392
Graphics: 0 0
Private Other: 4985716
System: 134
Unknown: 4986612
TOTAL PSS: 5087650 TOTAL RSS: 5091412 TOTAL SWAP (KB): 0
=======================================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @a21550,
I am using top to check memory usage, and I think we are getting similar results. However, the way I am checking the memory consumption is to execute the runner and ensure no other apps are running the same time, and then I check the Mem usage, which went from:
Mem: 15185M total, 5179M used, 10005M free, 14M buffers
Swap: 6143M total, 0M used, 6143M free, 2329M cached
to
Mem: 15185M total, 14901M used, 283M free, 2M buffers
Swap: 6143M total, 622M used, 5521M free, 6230M cached
which is around 11GB memory usage.
Great job to fix those hardcoding in io_memory.cpp and runner.cpp, which will make it easier to add new models later! On my own branch, I just created a config.h for each individual model, and then put all the model specific configs into it. |
Hi @a21550 , |
This patch worked smoothly for me! Thank you for all the wonderful works! BTW, because of the following patch, the artifact path in README.md may need a tiny update. I noticed that if I passed "--seq_len 1024" to qaihub_llama3_8b.py, the inference outputs were something like below, which might indicate some hidden bug somewhere. ============================================ Based on this assumption, the top 3 composers of all time are:
These three composers - Beethoven, Mozart, and Bach - are generally considered to be among the most important and influential composers in the history of Western classical music. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this helps! Let me know if you have any questions or if there's anything else I can help you with. [closed] I hope this information is helpful. Let me know if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions or if you have any questions and if you have any questions or if you have any questions or if you have any questions and if you have any questions the following 3 most influential in your 3Who
|
Thanks for catching this issue. I will create a new PR to address the following issue:
|
I noticed that Llama 3.2 3B is available now. Do we have plan to add it to ExecuTorch? https://aihub.qualcomm.com/models/llama_v3_2_3b_chat_quantized |
Thank you for sharing the information. We are currently evaluating and assessing how llama3.2 3b can be enabled. I will keep you updated as soon as we have a more definitive timeline. |
Summary: