Skip to content

EricLBuehler/mistral.rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

mistral.rs

Blazingly fast LLM inference.

| Rust Documentation | Python Documentation | Discord | Matrix |

Mistral.rs is a cross-platform, highly multimodal inference engine featuring support for text, vision, image generation, and speech generation models!

Please submit requests for new models here.

Get started fast πŸš€

  1. Install

  2. Get models

  3. Deploy with our easy to use APIs

  4. Try the web chat app for local in-browser conversation (text, vision, and speech support):


πŸ–₯️ Web Chat App
Web Chat UI Demo
Try our modern in-browser chat with text, vision, and speech support (TTS generation).
πŸ’» Terminal Interactive Mode
Terminal Interactive Mode
Prefer the terminal? Use interactive mode for a classic CLI experience.

Quick examples

After following installation instructions

  • πŸ”Š Run the Dia 1.6b model for highly-realistic dialogue generation: documentation

    Show command
    ./mistralrs-server -i speech -m nari-labs/Dia-1.6B -a dia
  • πŸ¦™ Run the Llama 3.* and Llama 4 models with long context & vision support: docs (llama 3.2), docs (llama 4)

    Show command

    Llama 4:

    ./mistralrs-server -i --isq 4 run -m meta-llama/Llama-4-Scout-17B-16E-Instruct

    Llama 3.1/3.2/3.3:

    ./mistralrs-server -i --isq 8 run -m meta-llama/Llama-3.2-3B-Instruct
    

    Llama 3.2 vision:

    ./mistralrs-server -i --isq 8 run -m meta-llama/Llama-3.2-11B-Vision-Instruct
    
  • πŸ’ŽπŸ’ŽπŸ’Ž Run the Gemma 3 family (1b, 4b, 12b, 27b) with 128k context & vision support: documentation

    Show command
    ./mistralrs-server -i --isq 8 run -m google/gemma-3-4b-it
  • πŸŒ²πŸ“· Run the FLUX.1 diffusion model: documentation

    Show command
    ./mistralrs-server -i diffusion -m black-forest-labs/FLUX.1-schnell -a flux
  • 🧠 Run the Qwen 3 hybrid-reasoning model with full tool-calling support: documentation

    Show command
    ./mistralrs-server -i --isq 8 run -m Qwen/Qwen3-8B

Description

mistral.rs is a blazing-fast, cross-platform LLM inference engine with support for text, vision, image generation, and speech.

Key Benefits:

  1. Ease of Use

  2. Performance

  3. Quantization

  4. Flexibility

  5. Advanced Features

APIs and Integrations

Rust Crate

Rust multithreaded/async API for easy integration into any application.

  • Docs
  • Examples
  • To use: add mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" } to your Cargo.toml

Python API

Python API for mistral.rs.

HTTP Server

OpenAI API compatible API server

Llama Index integration


Supported accelerators

Accelerator Feature Flag Additional Flags
NVIDIA GPUs (CUDA) cuda flash-attn, flash-attn-v3, cudnn
Apple Silicon GPU (Metal) metal
CPU (Intel) mkl
CPU (Apple Accelerate) accelerate
Generic CPU (ARM/AVX) none ARM NEON / AVX enabled by default

To enable one or more features, pass them to Cargo. For example:

cargo build --release --features "cuda flash-attn cudnn"

Installation and Build

Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/

  1. Install required packages:

    • OpenSSL (Example on Ubuntu: sudo apt install libssl-dev)
    • Linux only: pkg-config (Example on Ubuntu: sudo apt install pkg-config)
  2. Install Rust: https://rustup.rs/

    Example on Ubuntu:

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    source $HOME/.cargo/env
  3. Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the token_source parameters in Python or the command line.)

    • Note: you can install huggingface-cli as documented here.
    huggingface-cli login
  4. Download the code:

    git clone https://github.com/EricLBuehler/mistral.rs.git
    cd mistral.rs
  5. Build or install mistralrs-server:

    • Build the mistralrs-server binary, which can be found at target/release/mistralrs-server.

      cargo build --release --features <specify feature(s) here>
    • Install with cargo install for easy command line usage

      Pass the same values to --features as you would for cargo build

      cargo install --path mistralrs-server --features <specify feature(s) here>
  6. (If you used cargo build) The build process will output a binary mistralrs-server at ./target/release/mistralrs-server. We can switch to that directory so that the binary can be accessed as ./mistralrs-server with the following command:

    Example on Ubuntu:

    cd target/release
    
  7. Use our APIs and integrations:

    APIs and integrations list

Getting models

Show: How to get models (Hub, local, GGUF, adapters, etc.)

Getting models from Hugging Face Hub

  • Default: Downloads from Hugging Face Hub.
  • For gated models, you can optionally set token source:
    • CLI: ./mistralrs-server --token-source env:HF_TOKEN ...
    • Python: See examples/python/token_source.py
    • If no token is found, tries ~/.cache/huggingface/token or runs with no token.

Loading models from local files

  • Pass a path to a downloaded model from Hugging Face hub:
    • Example:
      ./mistralrs-server -i run -m path/to/model
      

Running GGUF models

  • Minimal example:
    ./mistralrs-server gguf -m author/model-repo -f model-quant.gguf
    
  • Specify tokenizer (if needed):
    ./mistralrs-server gguf -m author/model-repo -f file.gguf -t author/official-tokenizer
    
    (Or use the built-in GGUF tokenizer.)

Adapters, X-LoRA, LoRA, Chat Templates

  • Use the correct subcommand (x-lora-*, lora-*), pass model, adapter, or quant file as needed.
  • See docs/ADAPTER_MODELS.md for details.
  • For chat templates: usually auto-detected, override with --chat-template <file>.
    See docs/CHAT_TOK.md.

More model CLI examples

Using the CLI

Mistral.rs uses subcommands to control the model type. Please run ./mistralrs-server --help to see the subcommands which categorize the models by kind.

🚨 Important: The run subcommand (alias for plain/vision-plain) only auto-detects and runs text and vision models. It does not support diffusion or speech models. To run a diffusion model (e.g. FLUX series), use the diffusion subcommand:

mistralrs-server -i diffusion -m <model-id> [options]

To run a speech model (e.g. Dia), use the speech subcommand:

mistralrs-server -i speech -m <model-id> [options]

If you attempt to use run with diffusion or speech models, model loading will fail.

Interactive mode

Llama 3.2 3B running on an M3 Max with 8-bit ISQ:

Interactive demo

You can launch interactive mode, a simple chat application running in the terminal, by passing -i:

./mistralrs-server -i plain -m meta-llama/Llama-3.2-3B-Instruct

Vision models work seamlessly:

./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k

Diffusion models can be run too (quantization and adapters are not yet supported):

./mistralrs-server -i diffusion -m black-forest-labs/FLUX.1-schnell -a flux

And you can run speech generation in your terminal!

./mistralrs-server -i speech -m nari-labs/Dia-1.6B -a dia

OpenAI HTTP server

You can launch an HTTP server by replacing -i with --port <port>. For instance:

./mistralrs-server --port 1234 run -m microsoft/Phi-3.5-MoE-instruct

You can find documentation about the server itself here.

Structured selection with a .toml file

We provide a method to select models with a .toml file. The keys are the same as the command line, with no_kv_cache and tokenizer_json being "global" keys.

Example:

./mistralrs-server --port 1234 toml -f toml-selectors/gguf.toml

Architecture for plain models

Note: for plain models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device. This is specified in the --dype/-d parameter after the model architecture (plain). For quantized models (gguf/ggml), you may specify data type of f32 or bf16 (f16 is not recommended due to its lower precision in quantized inference).

If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.

Show plain architectures
  • mistral
  • gemma
  • mixtral
  • llama
  • phi2
  • phi3
  • phi3.5moe
  • qwen2
  • gemma2
  • starcoder2
  • deepseekv2
  • deepseekv3
  • qwen3
  • qwen3moe

Architecture for vision models

Note: for vision models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device. This is specified in the --dype/-d parameter after the model architecture (vision-plain).

Show vision architectures
  • phi3v
  • idefics2
  • llava_next
  • llava
  • vllama
  • qwen2vl
  • idefics3
  • minicpmo
  • phi4mm
  • qwen2_5vl
  • gemma3
  • mistral3
  • llama4

Supported GGUF architectures

Show supported GGUF architectures

Plain:

  • llama
  • phi2
  • phi3
  • starcoder2
  • qwen2

With adapters:

  • llama
  • phi3

Please submit more benchmarks via raising an issue!

Supported models

Show quantization support

Quantization support

Model GGUF GGML ISQ
Mistral βœ… βœ…
Gemma βœ…
Llama βœ… βœ… βœ…
Mixtral βœ… βœ…
Phi 2 βœ… βœ…
Phi 3 βœ… βœ…
Phi 3.5 MoE βœ…
Qwen 2.5 βœ…
Phi 3 Vision βœ…
Idefics 2 βœ…
Gemma 2 βœ…
Starcoder 2 βœ… βœ…
LLaVa Next βœ…
LLaVa βœ…
Llama 3.2 Vision βœ…
Qwen2-VL βœ…
Idefics 3 βœ…
Deepseek V2 βœ…
Deepseek V3 βœ…
MiniCPM-O 2.6 βœ…
Qwen2.5-VL βœ…
Gemma 3 βœ…
Mistral 3 βœ…
Llama 4 βœ…
Qwen 3 βœ…
Dia 1.6b βœ…
Show device mapping support

Device mapping support

Model category Supported
Plain βœ…
GGUF βœ…
GGML
Vision Plain βœ…
Show X-LoRA and LoRA support

X-LoRA and LoRA support

Model X-LoRA X-LoRA+GGUF X-LoRA+GGML
Mistral βœ… βœ…
Gemma βœ…
Llama βœ… βœ… βœ…
Mixtral βœ… βœ…
Phi 2 βœ…
Phi 3 βœ… βœ…
Phi 3.5 MoE
Qwen 2.5
Phi 3 Vision
Idefics 2
Gemma 2 βœ…
Starcoder 2 βœ…
LLaVa Next
LLaVa
Qwen2-VL
Idefics 3
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3
Mistral 3
Llama 4
Qwen 3
Show AnyMoE support

AnyMoE support

Model AnyMoE
Mistral 7B βœ…
Gemma βœ…
Llama βœ…
Mixtral
Phi 2 βœ…
Phi 3 βœ…
Phi 3.5 MoE
Qwen 2.5 βœ…
Phi 3 Vision
Idefics 2
Gemma 2 βœ…
Starcoder 2 βœ…
LLaVa Next βœ…
LLaVa βœ…
Llama 3.2 Vision
Qwen2-VL
Idefics 3 βœ…
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3 βœ…
Mistral 3 βœ…
Llama 4
Qwen 3

Using derivative and adapter models

To use a derivative or adapter model (e.g., quantized, LoRA, X-LoRA, vision, etc.), select the correct architecture subcommand and pass the required argumentsβ€”typically model id, and for quantized/adapters, also the quantization filename, tokenizer, or adapter ordering if needed.

Arguments by model type
Model Type Required Arguments
Plain model id
Quantized model id, quantized filename, tokenizer id
X-LoRA model id, X-LoRA ordering (if not default)
X-LoRA quantized model id, quantized filename, tokenizer id, X-LoRA ordering
LoRA model id, LoRA ordering (if not default)
LoRA quantized model id, quantized filename, tokenizer id, LoRA ordering
Vision Plain model id
Example: Zephyr GGUF model
./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf

Chat template and tokenizer are usually auto-detected.
If you need to override, see the chat templates doc.

An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-* architecture, and LoRA support by selecting the lora-* architecture. Please find docs for adapter models here. Examples may be found here.

Chat Templates and Tokenizer

Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.

Contributing

Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.

FAQ

  • Debugging with the environment variable MISTRALRS_DEBUG=1 causes the following things
    • If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
      • mistralrs_gguf_tensors.txt or mistralrs_ggml_tensors.txt
    • More logging.
  • Setting the CUDA compiler path:
    • Set the NVCC_CCBIN environment variable during build.
  • Error: recompile with -fPIE:
    • Some Linux distributions require compiling with -fPIE.
    • Set the CUDA_NVCC_FLAGS environment variable to -fPIE during build: CUDA_NVCC_FLAGS=-fPIE
  • Error CUDA_ERROR_NOT_FOUND or symbol not found when using a normal or vison model:
    • For non-quantized models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device.
  • What is the minimum supported CUDA compute cap?
    • The minimum CUDA compute cap is 5.3.
  • Metal not found (error: unable to find utility "metal", not a developer tool or in PATH)
    1. Install Xcode: xcode-select --install
    2. Set the active developer directory: sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer

Credits

This project would not be possible without the excellent work at candle. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.

⬆️ Back to Top

About

Blazingly fast LLM inference.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 53