Skip to content

enable flash mistral model for HPU device #594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 21, 2025

Conversation

kaixuanliu
Copy link
Contributor

This PR enables op level optimizations for Mistral type model. Currently it supports HPU device, and can get best throughput from 124 sentence/s to 133 sentences/s compared with Optimum-habana modeling, (We use Salesforce/SFR-Embedding-2_R for benchmark.)

@kaixuanliu
Copy link
Contributor Author

@regisss @Narsil pls help review

regisss
regisss previously approved these changes Apr 21, 2025
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@regisss
Copy link
Collaborator

regisss commented Apr 21, 2025

cc @Narsil

@regisss
Copy link
Collaborator

regisss commented Apr 21, 2025

@kaixuanliu It seems there are trailing whitespaces in backends/python/server/text_embeddings_server/models/flash_mistral.py, can you remove them please?

@kaixuanliu
Copy link
Contributor Author

@regisss Oh sorry, I added unnecessary code by mistake, have deleted them.

@regisss regisss merged commit d8021c3 into huggingface:main Apr 21, 2025
3 of 13 checks passed
@kaixuanliu kaixuanliu deleted the flash-mistral branch April 23, 2025 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants