Releases: huggingface/text-generation-inference
Releases · huggingface/text-generation-inference
v0.7.0
Features
- server: reduce vram requirements of continuous batching (contributed by @njhill)
- server: Support BLOOMChat-176B (contributed by @njhill)
- server: add watermarking tests (contributed by @ehsanmok)
- router: Adding response schema for compat_generate (contributed by @gsaivinay)
- router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
- server: improve download and decrease conversion to safetensors RAM requirements
- server: optimize flash causal lm decode token
- server: shard decode token
- server: use cuda graph in logits warping
- server: support trust_remote_code
- tests: add snapshot testing
Fix
- server: use float16
- server: fix multinomial implem in Sampling
- server: do not use device_map auto on single GPU
Misc
- docker: use nvidia base image
New Contributors
- @ehsanmok made their first contribution in #248
- @gsaivinay made their first contribution in #292
- @xyang16 made their first contribution in #343
- @oOraph made their first contribution in #359
Full Changelog: v0.6.0...v0.7.0
v0.6.0
Features
- server: flash attention past key values optimization (contributed by @njhill)
- router: remove requests when client closes the connection (co-authored by @njhill)
- server: support quantization for flash models
- router: add info route
- server: optimize token decode
- server: support flash sharded santacoder
- security: image signing with cosign
- security: image analysis with trivy
- docker: improve image size
Fix
- server: check cuda capability before importing flash attention
- server: fix hf_transfer issue with private repositories
- router: add auth token for private tokenizers
Misc
- rust: update to 1.69
v0.5.0
Features
- server: add flash-attention based version of Llama
- server: add flash-attention based version of Santacoder
- server: support OPT models
- router: make router input validation optional
- docker: improve layer caching
Fix
- server: improve token streaming decoding
- server: fix escape charcaters in stop sequences
- router: fix NCCL desync issues
- router: use buckets for metrics histograms
v0.4.3
Fix
- router: fix OTLP distributed tracing initialization
v0.4.2
Features
- benchmark: tui based benchmarking tool
- router: Clear cache on error
- server: Add mypy-protobuf
- server: reduce mlp and attn in one op for flash neox
- image: aws sagemaker compatible image
Fix
- server: avoid try/except to determine the kind of AutoModel
- server: fix flash neox rotary embedding
v0.4.1
Features
- server: New faster GPTNeoX implementation based on flash attention
Fix
- server: fix input-length discrepancy between Rust and Python tokenizers
v0.4.0
Features
- router: support best_of sampling
- router: support left truncation
- server: support typical sampling
- launcher: allow local models
- clients: add text-generation Python client
- launcher: allow parsing num_shard from CUDA_VISIBLE_DEVICES
Fix
- server: do not warp prefill logits
- server: fix formatting issues in generate_stream tokens
- server: fix galactica batch
- server: fix index out of range issue with watermarking
v0.3.2
Features
- router: add support for huggingface api-inference
- server: add logits watermark with "A Watermark for Large Language Models"
- server: use a fixed transformers commit
Fix
- launcher: add missing parameters to launcher
- server: update to hf_transfer==0.1.2 to fix corrupted files issue
v0.3.1
Features
- server: allocate full attention mask to decrease latency
- server: enable hf-transfer for insane download speeds
- router: add CORS options
Fix
- server: remove position_ids from galactica forward
v0.3.0
Features
- server: support t5 models
- router: add max_total_tokens and empty_input validation
- launcher: add the possibility to disable custom CUDA kernels
- server: add automatic safetensors conversion
- router: add prometheus scrape endpoint
- server, router: add distributed tracing
Fix
- launcher: copy current env vars to subprocesses
- docker: add note around shared memory