v3.3.0
Notable changes
- Prefill chunking for VLMs.
What's Changed
- Fixing Qwen 2.5 VL (32B). by @Narsil in #3157
- Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in #3156
- Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in #3113
- L4 fixes by @mht-sharma in #3161
- setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in #3171
- transformers flash llm/vlm enabling in ipex by @sywangyi in #3152
- Upgrading the dependencies in Gaudi backend. by @Narsil in #3170
- Hotfixing gaudi deps. by @Narsil in #3174
- Hotfix gaudi2 with newer transformers. by @Narsil in #3176
- Support flashinfer for Gemma3 prefill by @danieldk in #3167
- Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in #2648
- Bump
sccache
to 0.10.0 by @alvarobartt in #3179 - Fixing CI by @Narsil in #3184
- Add option to configure prometheus port by @mht-sharma in #3187
- Warmup gaudi backend by @sywangyi in #3172
- Put more wiggle room. by @Narsil in #3189
- Fixing the router + template for Qwen3. by @Narsil in #3200
- Skip
{% generation %}
and{% endgeneration %}
template handling by @alvarobartt in #3204 - doc typo by @julien-c in #3206
- Pr 2982 ci branch by @drbh in #3046
- fix: bump snaps for mllama by @drbh in #3202
- Update client SDK snippets by @julien-c in #3207
- Fix
HF_HUB_OFFLINE=1
for Gaudi backend by @regisss in #3193 - IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in #3144
- forward and tokenize chooser use the same shape by @sywangyi in #3196
- Chunked Prefill VLM by @mht-sharma in #3188
- Prepare for 3.3.0 by @danieldk in #3220
New Contributors
Full Changelog: v3.2.3...v3.3.0