[MPS] Add support for Int4 groupwise quantization #4623

DenisVieriu97 · 2024-08-09T02:46:08Z

Add support for MPS Int4 per channel group-wise quantization through MPSGraph.

Testing:
AOT export

python -m examples.models.llama2.export_llama --checkpoint /Volumes/Source/weights/llama2/llama2-7b/llama-2-7b/consolidated.00.pth --params /Volumes/Source/weights/llama2/llama2-7b/llama-2-7b/params.json -kv --use_sdpa_with_kv_cache --mps -d fp32 --disable_dynamic_shape -qmode 8da4w -G 32

Runtime (note that macOS 15.0 (Sequoia) or iOS/iPadOS 18 for Int4 Quantization:

~/tools/buck2_old2 run examples/models/llama2:main -- --model_path=mps_llama2_q.pte --tokenizer_path=tokenizer_llama2.bin --prompt="What is the best place to visit in New York?"  --temperature=0

Answer:

What is the best place to visit in New York?
New York is a city that has something for everyone. Whether you’re looking for a place to relax and enjoy the sights, or you’re looking for a place to party and have a good time, New York has it all.
There are so many different places to visit in New York, it can be hard to decide where to go. But don’t worry, we’ve got you covered. We’ve compiled a list of the best

Note: this is dependent of #4574 to be merged first!

cc: @cccclai, @larryliu0820, @kimishpatel

pytorch-bot · 2024-08-09T02:46:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4623

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit 033c562 with merge base 6efc222 ():

NEW FAILURES - The following jobs have failed:

Apple / test-demo-ios / macos-job (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 65
Apple / upload-frameworks-ios (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / test-models-macos (cmake, add_mul, xnnpack-quantization-delegation, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist
trunk / test-models-macos (cmake, mv2, portable, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist
trunk / test-models-macos (cmake, vit, xnnpack-delegation, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-08-09T18:37:26Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

larryliu0820 · 2024-08-09T18:40:45Z

This is awesome! It seems this PR includes all the changes in #4574?

cccclai · 2024-08-09T18:49:07Z

this pr needs to be landed after the 4GB serialization pr.

cccclai · 2024-08-09T19:08:23Z

Thanks for adding the pr. Really glad to have it enable llama models.

A separate question, looks like we're using the source tranform from -qmode 8da4w. If apply this pr to stories, are we using gpu or ANE?

facebook-github-bot · 2024-08-12T19:44:42Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-13T21:46:41Z