-
Notifications
You must be signed in to change notification settings - Fork 608
Qualcomm AI Engine Direct - enable loading context binary directly #4163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4163
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9878eb8 with merge base 740a0a5 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
b655ce1
to
42551da
Compare
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to ping me for re-review after updating
p.bwAxisScaleOffsetEncoding.scales = reinterpret_cast<float*>( | ||
const_cast<uint8_t*>(param->scales()->Data())); | ||
p.bwAxisScaleOffsetEncoding.offsets = reinterpret_cast<int32_t*>( | ||
const_cast<uint8_t*>(param->offsets()->Data())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just call data()
directly instead of reinterpreting? https://github.com/google/flatbuffers/blob/master/include/flatbuffers/array.h#L101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, but const qualifier would be intended kept for checking instead of using mutable_xxx
to get pointers.
std::shared_ptr<TensorWrapper> tensor_wrapper = | ||
CreateTensorWrapper(output_tensors[i]); | ||
tensor_wrapper->UpdateQnnTensorMeta(output_tensors[i]); | ||
std::string tensor_name = tensor_wrapper->GetName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does TensorWrapper::GetName() return a copy of the string instead of a reference? (I was going to correct this to const auto& tensor_name = tensor_wrapper->GetName()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, didn't notice this.
module, | ||
tuple( | ||
torch.randn(size=v.shape, dtype=v.dtype) | ||
for _, v in bundle_program["inputs"].items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is bundle_program["inputs"]
a dict? if so, why not for v in bundle_program["inputs"].values()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out, will change for all dictionary related comments.
module, | ||
tuple( | ||
torch.randn(size=v.shape, dtype=v.dtype) | ||
for _, v in bundle_program["inputs"].items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
for (int i = 0; i < 8; ++i) { // layers per shard | ||
for (int j = 0; j < 2; ++j) { // k_cache + v_cache | ||
for (int k = 0; k < 32; ++k) { // heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well make some named constants for these and then you don't need the comments here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks.
for (int i = 0; i < 8; ++i) { // layers per shard | ||
for (int j = 0; j < 2; ++j) { // k_cache + v_cache | ||
for (int k = 0; k < 32; ++k) { // heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto named constants
cv_.wait(lock, [this] { return !jobs_.empty() || quit_; }); | ||
|
||
if (quit_ && jobs_.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure on the semantics of quit_ here; I would have guessed that quit_
would mean to stop even if not all the jobs are done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to stop_
.
for (int i = 0; i < vocab_size_; i += 4) { | ||
const uint16_t* in = logits + i; | ||
float* out = logits_f.data() + i; | ||
int32x4_t q = {in[0], in[1], in[2], in[3]}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend issuing a vectorized load and then vcvtq_s32_u16 here because this probably does a bunch of slow fmovs, though I haven't checked generated assembly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, I think there has no vcvtq_u16_s32
intrinsic. Change code a little bit to omit fmov
(verified with compiler explorer).
} | ||
|
||
std::vector<Result<MethodMeta>> Runner::get_methods_meta() { | ||
std::vector<Result<MethodMeta>> methods_meta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reserve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, changed.
Hi @swolchok, thank you so much for great comments. I apply all the reviews but do not reply every conversation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good! re:unique_ptr deleter, it looks like a function pointer works fine: https://godbolt.org/z/nbfja83oY
Mind rebase? Seems like there is land race |
Summary: - add utilities for loading context binary generated from qnn tools - align env variable naming with qnn - fix bug in online prepare and extend coverage to support bitwise quatization - llama7b e2e example from qualcomm ai_hub - minor fixes for syle & typo
8cd43ed
to
9878eb8
Compare
Updated, thank you for the hint. |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I noticed that Llama 3 is available on https://huggingface.co/qualcomm/Llama-v3-8B-Chat. Any plan to add Llama 3 support through llama_qaihub.py? |
We're working on it. Will reply on this thread once PR is ready. |
Hi @a21550, llama3 8b is ready for trial on #4789. Please run on a mobile device with at least 12GB ram. Disable the low memory killer if process got killed: # conducting with a fresh status
adb -s $DEVICE_SERIAL reboot
adb -s $DEVICE_SERIAL root
adb -s $DEVICE_SERIAL shell
# type following on device to disable low memory killer
cd /sys/devices/system/memory
for i in $(ls | grep memory); do echo 0 > $i/online; done
for i in $(ls | grep memory); do echo online_kernel > $i/state; done |
Congratulations for making Llama 3 work!!! I will play it around and let you know!
|
This PR needs a
|
Summary: