Release v3.3.0 · huggingface/text-generation-inference

Notable changes

Prefill chunking for VLMs.

What's Changed

Fixing Qwen 2.5 VL (32B). by @Narsil in #3157
Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in #3156
Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in #3113
L4 fixes by @mht-sharma in #3161
setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in #3171
transformers flash llm/vlm enabling in ipex by @sywangyi in #3152
Upgrading the dependencies in Gaudi backend. by @Narsil in #3170
Hotfixing gaudi deps. by @Narsil in #3174
Hotfix gaudi2 with newer transformers. by @Narsil in #3176
Support flashinfer for Gemma3 prefill by @danieldk in #3167
Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in #2648
Bump sccache to 0.10.0 by @alvarobartt in #3179
Fixing CI by @Narsil in #3184
Add option to configure prometheus port by @mht-sharma in #3187
Warmup gaudi backend by @sywangyi in #3172
Put more wiggle room. by @Narsil in #3189
Fixing the router + template for Qwen3. by @Narsil in #3200
Skip {% generation %} and {% endgeneration %} template handling by @alvarobartt in #3204
doc typo by @julien-c in #3206
Pr 2982 ci branch by @drbh in #3046
fix: bump snaps for mllama by @drbh in #3202
Update client SDK snippets by @julien-c in #3207
Fix HF_HUB_OFFLINE=1 for Gaudi backend by @regisss in #3193
IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in #3144
forward and tokenize chooser use the same shape by @sywangyi in #3196
Chunked Prefill VLM by @mht-sharma in #3188
Prepare for 3.3.0 by @danieldk in #3220

New Contributors

@kozistr made their first contribution in #2648
@julien-c made their first contribution in #3206

Full Changelog: v3.2.3...v3.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v3.3.0

Notable changes

What's Changed

New Contributors

Contributors

Uh oh!