| Rust Documentation | Python Documentation | Discord | Matrix |
Mistral.rs is a cross-platform, highly multimodal inference engine featuring support for text, vision, image generation, and speech generation models!
Please submit requests for new models here.
-
Deploy with our easy to use APIs
-
Try the web chat app for local in-browser conversation (text, vision, and speech support):
- Quickstart here
- Run the server and visit http://localhost:8080 by default.
π₯οΈ Web Chat App

Try our modern in-browser chat with text, vision, and speech support (TTS generation).
π» Terminal Interactive Mode

Prefer the terminal? Use interactive mode for a classic CLI experience.
After following installation instructions
-
π Run the Dia 1.6b model for highly-realistic dialogue generation: documentation
Show command
./mistralrs-server -i speech -m nari-labs/Dia-1.6B -a dia
-
π¦ Run the Llama 3.* and Llama 4 models with long context & vision support: docs (llama 3.2), docs (llama 4)
Show command
Llama 4:
./mistralrs-server -i --isq 4 run -m meta-llama/Llama-4-Scout-17B-16E-Instruct
Llama 3.1/3.2/3.3:
./mistralrs-server -i --isq 8 run -m meta-llama/Llama-3.2-3B-Instruct
Llama 3.2 vision:
./mistralrs-server -i --isq 8 run -m meta-llama/Llama-3.2-11B-Vision-Instruct
-
πππ Run the Gemma 3 family (1b, 4b, 12b, 27b) with 128k context & vision support: documentation
Show command
./mistralrs-server -i --isq 8 run -m google/gemma-3-4b-it
-
π²π· Run the FLUX.1 diffusion model: documentation
Show command
./mistralrs-server -i diffusion -m black-forest-labs/FLUX.1-schnell -a flux
-
π§ Run the Qwen 3 hybrid-reasoning model with full tool-calling support: documentation
Show command
./mistralrs-server -i --isq 8 run -m Qwen/Qwen3-8B
mistral.rs is a blazing-fast, cross-platform LLM inference engine with support for text, vision, image generation, and speech.
Key Benefits:
-
Ease of Use
- OpenAI-compatible HTTP server
- Rust API & Python API
- Automatic device mapping (multi-GPU, CPU)
- Chat templates & tokenizer auto-detection
-
Performance
- CPU acceleration (MKL, AVX, NEON, Accelerate)
- GPU acceleration (CUDA with FlashAttention & cuDNN, Metal)
- Automatic tensor parallelism for splitting models across multiple devices
- CUDA-specialized NCCL
- Heterogeneous, flexible Ring backend
-
Quantization
- In-place quantization (ISQ) of Hugging Face models
- GGML & GGUF support: 2β8 bit
- GPTQ, AWQ, AFQ, HQQ, FP8, BNB (int8/fp4/nf4)
- β Auto-select the fastest quant method
-
Flexibility
- LoRA & X-LoRA adapters with weight merging
- AnyMoE: create MoE models on any base model
- Sampling & penalty options
- Prompt chunking for large inputs
- Integrated tool calling with customizable Python/Rust native tool and search callbacks
-
Advanced Features
- High-throughput with PagedAttention & FlashAttention V2/V3
- Prefix caching (including multimodal)
- Customizable quantization with topology & UQFF format
- Speculative decoding across models
- β Agentic web search integration
Rust multithreaded/async API for easy integration into any application.
- Docs
- Examples
- To use: add
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
to your Cargo.toml
Python API for mistral.rs.
OpenAI API compatible API server
- API Docs.
- Launching the server or use the CLI
- Example
- Use or extend the server in other axum projects
Accelerator | Feature Flag | Additional Flags |
---|---|---|
NVIDIA GPUs (CUDA) | cuda |
flash-attn , flash-attn-v3 , cudnn |
Apple Silicon GPU (Metal) | metal |
|
CPU (Intel) | mkl |
|
CPU (Apple Accelerate) | accelerate |
|
Generic CPU (ARM/AVX) | none | ARM NEON / AVX enabled by default |
To enable one or more features, pass them to Cargo. For example:
cargo build --release --features "cuda flash-attn cudnn"
Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/
- Install the Python package here.
- The Python package has wheels on PyPi!
-
Install required packages:
OpenSSL
(Example on Ubuntu:sudo apt install libssl-dev
)- Linux only:
pkg-config
(Example on Ubuntu:sudo apt install pkg-config
)
-
Install Rust: https://rustup.rs/
Example on Ubuntu:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env
-
Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the
token_source
parameters in Python or the command line.)- Note: you can install
huggingface-cli
as documented here.
huggingface-cli login
- Note: you can install
-
Download the code:
git clone https://github.com/EricLBuehler/mistral.rs.git cd mistral.rs
-
Build or install
mistralrs-server
:-
Build the
mistralrs-server
binary, which can be found attarget/release/mistralrs-server
.cargo build --release --features <specify feature(s) here>
-
Install with
cargo install
for easy command line usagePass the same values to
--features
as you would forcargo build
cargo install --path mistralrs-server --features <specify feature(s) here>
-
-
(If you used
cargo build
) The build process will output a binarymistralrs-server
at./target/release/mistralrs-server
. We can switch to that directory so that the binary can be accessed as./mistralrs-server
with the following command:Example on Ubuntu:
cd target/release
-
Use our APIs and integrations:
Show: How to get models (Hub, local, GGUF, adapters, etc.)
- Default: Downloads from Hugging Face Hub.
- For gated models, you can optionally set token source:
- CLI:
./mistralrs-server --token-source env:HF_TOKEN ...
- Python: See examples/python/token_source.py
- If no token is found, tries
~/.cache/huggingface/token
or runs with no token.
- CLI:
- Pass a path to a downloaded model from Hugging Face hub:
- Example:
./mistralrs-server -i run -m path/to/model
- Example:
- Minimal example:
./mistralrs-server gguf -m author/model-repo -f model-quant.gguf
- Specify tokenizer (if needed):
(Or use the built-in GGUF tokenizer.)
./mistralrs-server gguf -m author/model-repo -f file.gguf -t author/official-tokenizer
- Use the correct subcommand (
x-lora-*
,lora-*
), pass model, adapter, or quant file as needed. - See docs/ADAPTER_MODELS.md for details.
- For chat templates: usually auto-detected, override with
--chat-template <file>
.
See docs/CHAT_TOK.md.
- See Run with the CLI below or full documentation.
Mistral.rs uses subcommands to control the model type. Please run ./mistralrs-server --help
to see the subcommands which categorize the models by kind.
π¨ Important: The
run
subcommand (alias forplain
/vision-plain
) only auto-detects and runs text and vision models. It does not support diffusion or speech models. To run a diffusion model (e.g. FLUX series), use thediffusion
subcommand:mistralrs-server -i diffusion -m <model-id> [options]To run a speech model (e.g. Dia), use the
speech
subcommand:mistralrs-server -i speech -m <model-id> [options]If you attempt to use
run
with diffusion or speech models, model loading will fail.
Llama 3.2 3B running on an M3 Max with 8-bit ISQ:
You can launch interactive mode, a simple chat application running in the terminal, by passing -i
:
./mistralrs-server -i plain -m meta-llama/Llama-3.2-3B-Instruct
Vision models work seamlessly:
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k
Diffusion models can be run too (quantization and adapters are not yet supported):
./mistralrs-server -i diffusion -m black-forest-labs/FLUX.1-schnell -a flux
And you can run speech generation in your terminal!
./mistralrs-server -i speech -m nari-labs/Dia-1.6B -a dia
You can launch an HTTP server by replacing -i
with --port <port>
. For instance:
./mistralrs-server --port 1234 run -m microsoft/Phi-3.5-MoE-instruct
You can find documentation about the server itself here.
We provide a method to select models with a .toml
file. The keys are the same as the command line, with no_kv_cache
and tokenizer_json
being "global" keys.
Example:
./mistralrs-server --port 1234 toml -f toml-selectors/gguf.toml
Note: for plain models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (plain
). For quantized models (gguf/ggml), you may specify data type off32
orbf16
(f16
is not recommended due to its lower precision in quantized inference).
If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.
Show plain architectures
mistral
gemma
mixtral
llama
phi2
phi3
phi3.5moe
qwen2
gemma2
starcoder2
deepseekv2
deepseekv3
qwen3
qwen3moe
Note: for vision models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (vision-plain
).
Show vision architectures
phi3v
idefics2
llava_next
llava
vllama
qwen2vl
idefics3
minicpmo
phi4mm
qwen2_5vl
gemma3
mistral3
llama4
Show supported GGUF architectures
Plain:
- llama
- phi2
- phi3
- starcoder2
- qwen2
With adapters:
- llama
- phi3
Please submit more benchmarks via raising an issue!
Show quantization support
Quantization support
Model | GGUF | GGML | ISQ |
---|---|---|---|
Mistral | β | β | |
Gemma | β | ||
Llama | β | β | β |
Mixtral | β | β | |
Phi 2 | β | β | |
Phi 3 | β | β | |
Phi 3.5 MoE | β | ||
Qwen 2.5 | β | ||
Phi 3 Vision | β | ||
Idefics 2 | β | ||
Gemma 2 | β | ||
Starcoder 2 | β | β | |
LLaVa Next | β | ||
LLaVa | β | ||
Llama 3.2 Vision | β | ||
Qwen2-VL | β | ||
Idefics 3 | β | ||
Deepseek V2 | β | ||
Deepseek V3 | β | ||
MiniCPM-O 2.6 | β | ||
Qwen2.5-VL | β | ||
Gemma 3 | β | ||
Mistral 3 | β | ||
Llama 4 | β | ||
Qwen 3 | β | ||
Dia 1.6b | β |
Show device mapping support
Device mapping support
Model category | Supported |
---|---|
Plain | β |
GGUF | β |
GGML | |
Vision Plain | β |
Show X-LoRA and LoRA support
X-LoRA and LoRA support
Model | X-LoRA | X-LoRA+GGUF | X-LoRA+GGML |
---|---|---|---|
Mistral | β | β | |
Gemma | β | ||
Llama | β | β | β |
Mixtral | β | β | |
Phi 2 | β | ||
Phi 3 | β | β | |
Phi 3.5 MoE | |||
Qwen 2.5 | |||
Phi 3 Vision | |||
Idefics 2 | |||
Gemma 2 | β | ||
Starcoder 2 | β | ||
LLaVa Next | |||
LLaVa | |||
Qwen2-VL | |||
Idefics 3 | |||
Deepseek V2 | |||
Deepseek V3 | |||
MiniCPM-O 2.6 | |||
Qwen2.5-VL | |||
Gemma 3 | |||
Mistral 3 | |||
Llama 4 | |||
Qwen 3 |
Show AnyMoE support
AnyMoE support
Model | AnyMoE |
---|---|
Mistral 7B | β |
Gemma | β |
Llama | β |
Mixtral | |
Phi 2 | β |
Phi 3 | β |
Phi 3.5 MoE | |
Qwen 2.5 | β |
Phi 3 Vision | |
Idefics 2 | |
Gemma 2 | β |
Starcoder 2 | β |
LLaVa Next | β |
LLaVa | β |
Llama 3.2 Vision | |
Qwen2-VL | |
Idefics 3 | β |
Deepseek V2 | |
Deepseek V3 | |
MiniCPM-O 2.6 | |
Qwen2.5-VL | |
Gemma 3 | β |
Mistral 3 | β |
Llama 4 | |
Qwen 3 |
To use a derivative or adapter model (e.g., quantized, LoRA, X-LoRA, vision, etc.), select the correct architecture subcommand and pass the required argumentsβtypically model id, and for quantized/adapters, also the quantization filename, tokenizer, or adapter ordering if needed.
- See all options: Run
./mistralrs-server <subcommand> --help
- Docs: Adapter models, Chat templates
Arguments by model type
Model Type | Required Arguments |
---|---|
Plain | model id |
Quantized | model id, quantized filename, tokenizer id |
X-LoRA | model id, X-LoRA ordering (if not default) |
X-LoRA quantized | model id, quantized filename, tokenizer id, X-LoRA ordering |
LoRA | model id, LoRA ordering (if not default) |
LoRA quantized | model id, quantized filename, tokenizer id, LoRA ordering |
Vision Plain | model id |
Example: Zephyr GGUF model
./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
Chat template and tokenizer are usually auto-detected.
If you need to override, see the chat templates doc.
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-*
architecture, and LoRA support by selecting the lora-*
architecture. Please find docs for adapter models here. Examples may be found here.
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
- Debugging with the environment variable
MISTRALRS_DEBUG=1
causes the following things- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
mistralrs_gguf_tensors.txt
ormistralrs_ggml_tensors.txt
- More logging.
- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
- Setting the CUDA compiler path:
- Set the
NVCC_CCBIN
environment variable during build.
- Set the
- Error:
recompile with -fPIE
:- Some Linux distributions require compiling with
-fPIE
. - Set the
CUDA_NVCC_FLAGS
environment variable to-fPIE
during build:CUDA_NVCC_FLAGS=-fPIE
- Some Linux distributions require compiling with
- Error
CUDA_ERROR_NOT_FOUND
or symbol not found when using a normal or vison model:- For non-quantized models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device.
- For non-quantized models, you can specify the data type to load and run in. This must be one of
- What is the minimum supported CUDA compute cap?
- The minimum CUDA compute cap is 5.3.
- Metal not found (error: unable to find utility "metal", not a developer tool or in PATH)
- Install Xcode:
xcode-select --install
- Set the active developer directory:
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
- Install Xcode:
This project would not be possible without the excellent work at candle
. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.