Skip to content

Releases: huggingface/text-generation-inference

v0.7.0

23 May 19:21
d31562f
Compare
Choose a tag to compare

Features

  • server: reduce vram requirements of continuous batching (contributed by @njhill)
  • server: Support BLOOMChat-176B (contributed by @njhill)
  • server: add watermarking tests (contributed by @ehsanmok)
  • router: Adding response schema for compat_generate (contributed by @gsaivinay)
  • router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
  • server: improve download and decrease conversion to safetensors RAM requirements
  • server: optimize flash causal lm decode token
  • server: shard decode token
  • server: use cuda graph in logits warping
  • server: support trust_remote_code
  • tests: add snapshot testing

Fix

  • server: use float16
  • server: fix multinomial implem in Sampling
  • server: do not use device_map auto on single GPU

Misc

  • docker: use nvidia base image

New Contributors

Full Changelog: v0.6.0...v0.7.0

v0.6.0

21 Apr 19:02
6ded76a
Compare
Choose a tag to compare

Features

  • server: flash attention past key values optimization (contributed by @njhill)
  • router: remove requests when client closes the connection (co-authored by @njhill)
  • server: support quantization for flash models
  • router: add info route
  • server: optimize token decode
  • server: support flash sharded santacoder
  • security: image signing with cosign
  • security: image analysis with trivy
  • docker: improve image size

Fix

  • server: check cuda capability before importing flash attention
  • server: fix hf_transfer issue with private repositories
  • router: add auth token for private tokenizers

Misc

  • rust: update to 1.69

v0.5.0

11 Apr 18:32
6f0f1d7
Compare
Choose a tag to compare

Features

  • server: add flash-attention based version of Llama
  • server: add flash-attention based version of Santacoder
  • server: support OPT models
  • router: make router input validation optional
  • docker: improve layer caching

Fix

  • server: improve token streaming decoding
  • server: fix escape charcaters in stop sequences
  • router: fix NCCL desync issues
  • router: use buckets for metrics histograms

v0.4.3

30 Mar 15:29
fef1a1c
Compare
Choose a tag to compare

Fix

  • router: fix OTLP distributed tracing initialization

v0.4.2

30 Mar 15:10
84722f3
Compare
Choose a tag to compare

Features

  • benchmark: tui based benchmarking tool
  • router: Clear cache on error
  • server: Add mypy-protobuf
  • server: reduce mlp and attn in one op for flash neox
  • image: aws sagemaker compatible image

Fix

  • server: avoid try/except to determine the kind of AutoModel
  • server: fix flash neox rotary embedding

v0.4.1

26 Mar 14:38
ab5fd8c
Compare
Choose a tag to compare

Features

  • server: New faster GPTNeoX implementation based on flash attention

Fix

  • server: fix input-length discrepancy between Rust and Python tokenizers

v0.4.0

09 Mar 15:10
411d624
Compare
Choose a tag to compare

Features

  • router: support best_of sampling
  • router: support left truncation
  • server: support typical sampling
  • launcher: allow local models
  • clients: add text-generation Python client
  • launcher: allow parsing num_shard from CUDA_VISIBLE_DEVICES

Fix

  • server: do not warp prefill logits
  • server: fix formatting issues in generate_stream tokens
  • server: fix galactica batch
  • server: fix index out of range issue with watermarking

v0.3.2

03 Mar 17:42
1c19b09
Compare
Choose a tag to compare

Features

  • router: add support for huggingface api-inference
  • server: add logits watermark with "A Watermark for Large Language Models"
  • server: use a fixed transformers commit

Fix

  • launcher: add missing parameters to launcher
  • server: update to hf_transfer==0.1.2 to fix corrupted files issue

v0.3.1

24 Feb 12:27
4b1c972
Compare
Choose a tag to compare

Features

  • server: allocate full attention mask to decrease latency
  • server: enable hf-transfer for insane download speeds
  • router: add CORS options

Fix

  • server: remove position_ids from galactica forward

v0.3.0

16 Feb 16:33
c720555
Compare
Choose a tag to compare

Features

  • server: support t5 models
  • router: add max_total_tokens and empty_input validation
  • launcher: add the possibility to disable custom CUDA kernels
  • server: add automatic safetensors conversion
  • router: add prometheus scrape endpoint
  • server, router: add distributed tracing

Fix

  • launcher: copy current env vars to subprocesses
  • docker: add note around shared memory