Skip to content

Qualcomm AI Engine Direct - enable loading context binary directly #4163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

haowhsu-quic
Copy link
Collaborator

Summary:

  • add utilities for loading context binary generated from qnn tools
  • align env variable naming with qnn
  • fix bug in online prepare and extend coverage to support bitwise quatization
  • llama7b e2e example from qualcomm ai_hub
  • minor fixes for syle & typo

Copy link

pytorch-bot bot commented Jul 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4163

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9878eb8 with merge base 740a0a5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 8, 2024
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@swolchok swolchok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to ping me for re-review after updating

Comment on lines 219 to 222
p.bwAxisScaleOffsetEncoding.scales = reinterpret_cast<float*>(
const_cast<uint8_t*>(param->scales()->Data()));
p.bwAxisScaleOffsetEncoding.offsets = reinterpret_cast<int32_t*>(
const_cast<uint8_t*>(param->offsets()->Data()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just call data() directly instead of reinterpreting? https://github.com/google/flatbuffers/blob/master/include/flatbuffers/array.h#L101

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, but const qualifier would be intended kept for checking instead of using mutable_xxx to get pointers.

std::shared_ptr<TensorWrapper> tensor_wrapper =
CreateTensorWrapper(output_tensors[i]);
tensor_wrapper->UpdateQnnTensorMeta(output_tensors[i]);
std::string tensor_name = tensor_wrapper->GetName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does TensorWrapper::GetName() return a copy of the string instead of a reference? (I was going to correct this to const auto& tensor_name = tensor_wrapper->GetName())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, didn't notice this.

module,
tuple(
torch.randn(size=v.shape, dtype=v.dtype)
for _, v in bundle_program["inputs"].items()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is bundle_program["inputs"] a dict? if so, why not for v in bundle_program["inputs"].values()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out, will change for all dictionary related comments.

module,
tuple(
torch.randn(size=v.shape, dtype=v.dtype)
for _, v in bundle_program["inputs"].items()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 136 to 138
for (int i = 0; i < 8; ++i) { // layers per shard
for (int j = 0; j < 2; ++j) { // k_cache + v_cache
for (int k = 0; k < 32; ++k) { // heads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well make some named constants for these and then you don't need the comments here and below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks.

Comment on lines 291 to 293
for (int i = 0; i < 8; ++i) { // layers per shard
for (int j = 0; j < 2; ++j) { // k_cache + v_cache
for (int k = 0; k < 32; ++k) { // heads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto named constants

Comment on lines 437 to 439
cv_.wait(lock, [this] { return !jobs_.empty() || quit_; });

if (quit_ && jobs_.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure on the semantics of quit_ here; I would have guessed that quit_ would mean to stop even if not all the jobs are done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to stop_.

for (int i = 0; i < vocab_size_; i += 4) {
const uint16_t* in = logits + i;
float* out = logits_f.data() + i;
int32x4_t q = {in[0], in[1], in[2], in[3]};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend issuing a vectorized load and then vcvtq_s32_u16 here because this probably does a bunch of slow fmovs, though I haven't checked generated assembly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I think there has no vcvtq_u16_s32 intrinsic. Change code a little bit to omit fmov (verified with compiler explorer).

}

std::vector<Result<MethodMeta>> Runner::get_methods_meta() {
std::vector<Result<MethodMeta>> methods_meta;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed.

@haowhsu-quic
Copy link
Collaborator Author

Hi @swolchok, thank you so much for great comments. I apply all the reviews but do not reply every conversation.
Please take a look, thanks for your time.

Copy link
Contributor

@swolchok swolchok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! re:unique_ptr deleter, it looks like a function pointer works fine: https://godbolt.org/z/nbfja83oY

@cccclai
Copy link
Contributor

cccclai commented Jul 16, 2024

Mind rebase? Seems like there is land race

haowhsu-quic and others added 3 commits July 16, 2024 19:29
Summary:
- add utilities for loading context binary generated from qnn tools
- align env variable naming with qnn
- fix bug in online prepare and extend coverage to support bitwise quatization
- llama7b e2e example from qualcomm ai_hub
- minor fixes for syle & typo
@haowhsu-quic
Copy link
Collaborator Author

looking good! re:unique_ptr deleter, it looks like a function pointer works fine: https://godbolt.org/z/nbfja83oY

Updated, thank you for the hint.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@cccclai merged this pull request in 1b0bf1c.

@a21550
Copy link

a21550 commented Aug 7, 2024

I noticed that Llama 3 is available on https://huggingface.co/qualcomm/Llama-v3-8B-Chat. Any plan to add Llama 3 support through llama_qaihub.py?

@haowhsu-quic
Copy link
Collaborator Author

I noticed that Llama 3 is available on https://huggingface.co/qualcomm/Llama-v3-8B-Chat. Any plan to add Llama 3 support through llama_qaihub.py?

We're working on it. Will reply on this thread once PR is ready.
Thank you.

@haowhsu-quic
Copy link
Collaborator Author

Hi @a21550, llama3 8b is ready for trial on #4789. Please run on a mobile device with at least 12GB ram.

Disable the low memory killer if process got killed:

# conducting with a fresh status
adb -s $DEVICE_SERIAL reboot
adb -s $DEVICE_SERIAL root
adb -s $DEVICE_SERIAL shell
# type following on device to disable low memory killer
cd /sys/devices/system/memory
for i in $(ls | grep memory); do echo 0 > $i/online; done
for i in $(ls | grep memory); do echo online_kernel > $i/state; done

@a21550
Copy link

a21550 commented Aug 20, 2024

Congratulations for making Llama 3 work!!!

I will play it around and let you know!

Hi @a21550, llama3 8b is ready for trial on #4789. Please run on a mobile device with at least 12GB ram.

Disable the low memory killer if process got killed:

# conducting with a fresh status
adb -s $DEVICE_SERIAL reboot
adb -s $DEVICE_SERIAL root
adb -s $DEVICE_SERIAL shell
# type following on device to disable low memory killer
cd /sys/devices/system/memory
for i in $(ls | grep memory); do echo 0 > $i/online; done
for i in $(ls | grep memory); do echo online_kernel > $i/state; done

@sxu sxu removed the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 21, 2025
Copy link

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@sxu sxu added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants