🚀 Optimum-ExecuTorch: Major Out-of-the-box Performance Breakthrough for using HuggingFace Transformers on-device #12124

guangy10 · 2025-06-30T22:51:34Z

guangy10
Jun 30, 2025
Collaborator

We're thrilled to announce significant performance improvements in Optimum-ExecuTorch that make deploying Hugging Face models on mobile using ExecuTorch more efficient than ever before!

🔥 Outstanding Out-of-the-box Performance

Our latest optimizations deliver the following performance improvements on CPU backend:

3x faster inference with custom SDPA implementation compared to default HF's implementation of SDPA
2.5-10x performance boost with custom KV cache (static and hybrid) that performs in-place cache updates compared to default HF's implementation of KV cache
Over 20x performance boost on the "time to the first generate token" via batch prefill
Nimble memory/disk footprint through 8da4w quantization while maintain good generation quality

Note that those optimizations are composable and the relative perf grain of each is quantitatively measured by the CI in optimum-executorch. Through the combination of all these optimizations, you're able to run for example gemma-3-1b-it with about 25 tokens/s on Android phone S22 and 20 tokens/s on basic iPhone 15 out-of-the-box. You can read more in the optimum-executorch repo. Ready to experience these performance gains? Getting started is as simple as following this instruction.

💡 Transparency: All reported latency is measured transparently through ExecuTorch's benchmark infrastructure, with reliability validated through recent analysis.

🔮 What's Next

We're continuously pushing the boundaries of on-device AI performance. Stay tuned for:

More CPU optimizations in ExecuTorch, e.g. performant portable ops as shown in this case study
Additional backend support. Core ML support is happening already
Extended models to cover more multimodal use-cases

🛠️ Want to Contribute?

Help expand model support
Build real-world applications using ExecuTorch + Hugging Face models
Report bugs and share your feedback

🙏 Acknowledgments

Huge thanks to the incredible Hugging Face teams and ExecuTorch developers who made this breakthrough possible!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 Optimum-ExecuTorch: Major Out-of-the-box Performance Breakthrough for using HuggingFace Transformers on-device #12124

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

🚀 Optimum-ExecuTorch: Major Out-of-the-box Performance Breakthrough for using HuggingFace Transformers on-device #12124

Uh oh!

Uh oh!

guangy10 Jun 30, 2025 Collaborator

🔥 Outstanding Out-of-the-box Performance

🔮 What's Next

🛠️ Want to Contribute?

🙏 Acknowledgments

Replies: 0 comments

guangy10
Jun 30, 2025
Collaborator