Releases · huggingface/text-embeddings-inference · GitHub

18 Oct 11:40

v0.2.0

What's Changed

add support for XLM-RoBERTa in #5
get number of tokenization workers from the number of CPU cores in #8
prefetch batch in #10
support loading from .pth in #12
add --pooling arg in #14
fix compute cap matching in #21

Full Changelog: v0.1.0...v0.2.0

Assets 2

13 Oct 13:46

v0.1.0

No compilation step
Dynamic shapes
Small docker images and fast boot times. Get ready for true serverless!
Token based dynamic batching
Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt
Safetensors weight loading
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Assets 2