Releases: huggingface/text-generation-inference
Releases · huggingface/text-generation-inference
v2.0.3
Important changes
- Add: Support for the Falcon2 by @Nilabhra in #1886
- New speculation method MLPSpeculator. by @JRosenkranz in #1865
- Pali gemma modeling by @drbh in #1895
What's Changed
- Fix: "Fixing" double BOS for mistral too. by @Narsil in #1843
- Adding scripts to prepare load data. by @Narsil in #1841
- Remove misleading warning (not that important nowadays anyway). by @Narsil in #1848
- feat: prefer huggingface_hub in docs and show image api by @drbh in #1844
- Updating Phi3 (long context). by @Narsil in #1849
- Add router name to /info endpoint by @Wauplin in #1854
- Upgrading to rust 1.78. by @Narsil in #1851
- update xpu docker image and use public ipex whel by @sywangyi in #1860
- Refactor layers. by @Narsil in #1866
- Granite support? by @Narsil in #1882
- Add: Support for the Falcon2 11B architecture by @Nilabhra in #1886
- MLPSpeculator. by @JRosenkranz in #1865
- Fixing truncation. by @Narsil in #1890
- Correct 'using guidance' link by @brandon-lockaby in #1892
- Add GPT-2 with flash attention by @danieldk in #1889
- Removing accepted ids in the regular info logs, downgrade to debug. by @Narsil in #1898
- feat: add deprecation warning to clients by @drbh in #1855
- [Bug Fix] Update torch import reference in bnb quantization by @DhruvSrikanth in #1902
- Pali gemma modeling by @drbh in #1895
New Contributors
- @Nilabhra made their first contribution in #1886
- @brandon-lockaby made their first contribution in #1892
- @danieldk made their first contribution in #1889
- @DhruvSrikanth made their first contribution in #1902
Full Changelog: v2.0.2...v2.0.3
v2.0.2
Tl;dr
- New models (idefics2, phi3)
- Cleaner VLM support in the openai layer
- Upgraded to pytorch 2.3.0
What's Changed
- Make
--cuda-graphs 0
work as expected (bis) by @fxmarty in #1768 - fix typos in docs and add small clarifications by @MoritzLaurer in #1790
- Add attribute descriptions for
GenerateParameters
by @Wauplin in #1798 - feat: allow null eos and bos tokens in config by @drbh in #1791
- Phi3 support by @Narsil in #1797
- Idefics2. by @Narsil in #1756
- fix: avoid frequency and repetition penalty on padding tokens by @drbh in #1765
- Adding support for
HF_HUB_OFFLINE
support in the router. by @Narsil in #1789 - feat: improve temperature logic in chat by @drbh in #1749
- Updating the benchmarks so everyone uses openai compat layer. by @Narsil in #1800
- Update guidance docs to reflect grammar support in API by @dr3s in #1775
- Use the generation config. by @Narsil in #1808
- 2nd round of benchmark modifications (tiny adjustements to avoid overloading the host). by @Narsil in #1816
- Adding new env variables for TPU backends. by @Narsil in #1755
- add intel xpu support for TGI by @sywangyi in #1475
- Blunder by @Narsil in #1815
- Fixing qwen2. by @Narsil in #1818
- Dummy CI run. by @Narsil in #1817
- Changing the waiting_served_ratio default (stack more aggressively by default). by @Narsil in #1820
- Better graceful shutdown. by @Narsil in #1827
- Add the missing
tool_prompt
parameter to Python client by @maziyarpanahi in #1825 - Small CI cleanup. by @Narsil in #1801
- Add reference to TPU support by @brandonroyal in #1760
- fix: use get_speculate to the number of layers by @OlivierDehaene in #1737
- feat: add how it works section by @drbh in #1773
- Fixing frequency penalty by @martinigoyanes in #1811
- feat: add vlm docs and simple examples by @drbh in #1812
- Handle images in chat api by @drbh in #1828
- chore: update torch by @OlivierDehaene in #1730
- (chore): torch 2.3.0 by @Narsil in #1833
New Contributors
- @MoritzLaurer made their first contribution in #1790
- @dr3s made their first contribution in #1775
- @maziyarpanahi made their first contribution in #1825
- @brandonroyal made their first contribution in #1760
- @martinigoyanes made their first contribution in #1811
Full Changelog: v2.0.1...v2.0.2
v2.0.1
v2.0.0
TGI is back to Apache 2.0!
Highlights
- License was reverted to Apache 2.0
- Cuda graphs are now used by default. They improve latency substancially on high end nodes.
- Llava-next was added. It is the second multimodal model available on TGI after Idefics.
- Cohere Command R+ support. TGI is the fastest open source backend for Command R+
- FP8 support.
- We now share the vocabulary for all medusa heads, greatly improving latency and memory use.
Try out Command R+ with Medusa heads on 4xA100s with:
model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4
What's Changed
- Add cuda graphs sizes and make it default. by @Narsil in #1703
- Pickle conversion now requires
--trust-remote-code
. by @Narsil in #1704 - Push users to streaming in the readme. by @Narsil in #1698
- Fixing cohere tokenizer. by @Narsil in #1697
- Force weights_only (before fully breaking pickle files anyway). by @Narsil in #1710
- Regenerate ld.so.cache by @oOraph in #1708
- Revert license to Apache 2.0 by @OlivierDehaene in #1714
- Automatic quantization config. by @Narsil in #1719
- Adding Llava-Next (Llava 1.6) with full support. by @Narsil in #1709
- fix: fix CohereForAI/c4ai-command-r-plus by @OlivierDehaene in #1707
- Update libraries by @abhishekkrthakur in #1713
- Dev/mask ldconfig output v2 by @oOraph in #1716
- Fp8 Support by @Narsil in #1726
- Upgrade EETQ (Fixes the cuda graphs). by @Narsil in #1729
- fix(router): fix a possible deadlock in next_batch by @OlivierDehaene in #1731
- chore(cargo-toml): apply lto fat and codegen-units of one by @somehowchris in #1651
- Improve the defaults for the launcher by @Narsil in #1727
- feat: medusa shared by @OlivierDehaene in #1734
- Fix typo in guidance.md by @eltociear in #1735
New Contributors
- @somehowchris made their first contribution in #1651
Full Changelog: v1.4.5...v2.0.0
v.1.4.5
Highlights
What's Changed
- fix: adjust logprob response logic by @drbh in #1682
- fix: handle batches with and without grammars by @drbh in #1676
- feat: Add dbrx support by @OlivierDehaene in #1685
Full Changelog: v1.4.4...v1.4.5
v.1.4.4
Highlights
- CohereForAI/c4ai-command-r-v01 model support
What's Changed
- Handle concurrent grammar requests by @drbh in #1610
- Fix idefics default. by @Narsil in #1614
- Fix async client timeout by @hugoabonizio in #1617
- accept legacy request format and response by @drbh in #1527
- add missing stop parameter for chat request by @drbh in #1619
- correctly index into mask when applying grammar by @drbh in #1618
- Use a better model for the quick tour by @lewtun in #1639
- Upgrade nix version from 0.27.1 to 0.28.0 by @yuanwu2017 in #1638
- Update peft + transformers + accelerate + bnb + safetensors by @abhishekkrthakur in #1646
- Fix index in ChatCompletionChunk by @Wauplin in #1648
- Fixing minor typo in documentation: supported hardware section by @SachinVarghese in #1632
- bump minijina and add test for core templates by @drbh in #1626
- support force downcast after FastRMSNorm multiply for Gemma by @drbh in #1658
- prefer spaces url over temp url by @drbh in #1662
- improve tool type, bump pydantic and outlines by @drbh in #1650
- Remove unecessary cuda graph. by @Narsil in #1664
- Repair idefics integration tests. by @Narsil in #1663
- fix: LlamaTokenizerFast to AutoTokenizer at flash_mistral.py by @SeongBeomLEE in #1637
- Inline images for multimodal models. by @Narsil in #1666
New Contributors
- @hugoabonizio made their first contribution in #1617
- @yuanwu2017 made their first contribution in #1638
- @abhishekkrthakur made their first contribution in #1646
- @Wauplin made their first contribution in #1648
- @SachinVarghese made their first contribution in #1632
- @SeongBeomLEE made their first contribution in #1637
Full Changelog: v1.4.3...v1.4.4
v1.4.3
Highlights
- Add support for Starcoder 2
- Add support for Qwen2
What's Changed
- fix openapi schema by @OlivierDehaene in #1586
- avoid default message by @drbh in #1579
- Revamp medusa implementation so that every model can benefit. by @Narsil in #1588
- Support tools by @drbh in #1587
- Fixing x-compute-time. by @Narsil in #1606
- Fixing guidance docs. by @Narsil in #1607
- starcoder2 by @OlivierDehaene in #1605
- Qwen2 by @Jason-CKY in #1608
Full Changelog: v1.4.2...v1.4.3
v1.4.2
Highlights
- Add support for Google Gemma models
What's Changed
- Fix mistral with length > window_size for long prefills (rotary doesn't create long enough cos, sin). by @Narsil in #1571
- improve endpoint support by @drbh in #1577
- refactor syntax to correctly include structs by @drbh in #1580
- fix openapi and add jsonschema validation by @OlivierDehaene in #1578
- add support for Gemma by @OlivierDehaene in #1583
Full Changelog: v1.4.1...v1.4.2
v1.4.1
Highlights
- Mamba support by @drbh in #1480 and by @Narsil in #1552
- Experimental support for cuda graphs by @OlivierDehaene in #1428
- Outlines guided generation by @drbh in #1539
- Added
name
field to OpenAI compatible API Messages by @amihalik in #1563
What's Changed
- Fixing top_n_tokens. by @Narsil in #1497
- Sending compute type from the environment instead of hardcoded string by @Narsil in #1504
- Create the compute type at launch time (if not provided in the env). by @Narsil in #1505
- Modify default for max_new_tokens in python client by @freitng in #1336
- feat: eetq gemv optimization when batch_size <= 4 by @dtlzhuangz in #1502
- fix: improve messages api docs content and formatting by @drbh in #1506
- GPTNeoX: Use static rotary embedding by @dwyatte in #1498
- Hotfix the / health - route. by @Narsil in #1515
- fix: tokenizer config should use local model path when possible by @drbh in #1518
- Updating tokenizers. by @Narsil in #1517
- [docs] Fix link to Install CLI by @pcuenca in #1526
- feat: add ie update to message docs by @drbh in #1523
- feat: use existing add_generation_prompt variable from config in temp… by @drbh in #1533
- Update to peft 0.8.2 by @Stillerman in #1537
- feat(server): add frequency penalty by @OlivierDehaene in #1541
- chore: bump ci rust version by @drbh in #1543
- ROCm AWQ support by @IlyasMoutawwakil in #1514
- feat(router): add max_batch_size by @OlivierDehaene in #1542
- feat: add deserialize_with that handles strings or objects with content by @drbh in #1550
- Fixing glibc version in the runtime. by @Narsil in #1556
- Upgrade intermediary layer for nvidia too. by @Narsil in #1557
- Improving mamba runtime by using updates by @Narsil in #1552
- Small cleanup. by @Narsil in #1560
- Bugfix: eos and bos tokens positions are inconsistent by @amihalik in #1567
- chore: add pre-commit by @OlivierDehaene in #1569
- feat: add chat template struct to avoid tuple ordering errors by @OlivierDehaene in #1570
- v1.4.1 by @OlivierDehaene in #1568
New Contributors
- @freitng made their first contribution in #1336
- @dtlzhuangz made their first contribution in #1502
- @dwyatte made their first contribution in #1498
- @pcuenca made their first contribution in #1526
- @Stillerman made their first contribution in #1537
- @IlyasMoutawwakil made their first contribution in #1514
- @amihalik made their first contribution in #1563
Full Changelog: v1.4.0...v1.4.1
v1.4.0
Highlights
- OpenAI compatible API #1427
- exllama v2 Tensor Parallel #1490
- GPTQ support for AMD GPUs #1489
- Phi support #1442
What's Changed
- fix: fix local loading for .bin models by @OlivierDehaene in #1419
- Fix missing make target platform for local install: 'install-flash-attention-v2' by @deepily in #1414
- fix: follow base model for tokenizer in router by @OlivierDehaene in #1424
- Fix local load for Medusa by @PYNing in #1420
- Return prompt vs generated tokens. by @Narsil in #1436
- feat: supports openai chat completions API by @drbh in #1427
- feat: support raise_exception, bos and eos tokens by @drbh in #1450
- chore: bump rust version and annotate/fix all clippy warnings by @drbh in #1455
- feat: conditionally toggle chat on invocations route by @drbh in #1454
- Disable
decoder_input_details
on OpenAI-compatible chat streaming, pass temp and top-k from API by @EndlessReform in #1470 - Fixing non divisible embeddings. by @Narsil in #1476
- Add messages api compatibility docs by @drbh in #1478
- Add a new
/tokenize
route to get the tokenized input by @Narsil in #1471 - feat: adds phi model by @drbh in #1442
- fix: read stderr in download by @OlivierDehaene in #1486
- fix: show warning with tokenizer config parsing error by @drbh in #1488
- fix: launcher doc typos by @Narsil in #1473
- Reinstate exl2 with tp by @Narsil in #1490
- Add sealion mpt support by @Narsil in #1477
- Trying to fix that flaky test. by @Narsil in #1491
- fix: launcher doc typos by @thelinuxkid in #1462
- Update the docs to include newer models. by @Narsil in #1492
- GPTQ support on ROCm by @fxmarty in #1489
- feat: add tokenizer-config-path to launcher args by @drbh in #1495
New Contributors
- @deepily made their first contribution in #1414
- @PYNing made their first contribution in #1420
- @drbh made their first contribution in #1427
- @EndlessReform made their first contribution in #1470
- @thelinuxkid made their first contribution in #1462
Full Changelog: v1.3.4...v1.4.0