23 May 19:21

OlivierDehaene

d31562f

v0.7.0

Features

server: reduce vram requirements of continuous batching (contributed by @njhill)
server: Support BLOOMChat-176B (contributed by @njhill)
server: add watermarking tests (contributed by @ehsanmok)
router: Adding response schema for compat_generate (contributed by @gsaivinay)
router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
server: improve download and decrease conversion to safetensors RAM requirements
server: optimize flash causal lm decode token
server: shard decode token
server: use cuda graph in logits warping
server: support trust_remote_code
tests: add snapshot testing

Fix

server: use float16
server: fix multinomial implem in Sampling
server: do not use device_map auto on single GPU

Misc

docker: use nvidia base image

New Contributors

@ehsanmok made their first contribution in #248
@gsaivinay made their first contribution in #292
@xyang16 made their first contribution in #343
@oOraph made their first contribution in #359

Full Changelog: v0.6.0...v0.7.0

Contributors

ehsanmok, gsaivinay, and 3 other contributors

Assets 2

21 Apr 19:02

OlivierDehaene

v0.6.0

6ded76a

v0.6.0

Features

server: flash attention past key values optimization (contributed by @njhill)
router: remove requests when client closes the connection (co-authored by @njhill)
server: support quantization for flash models
router: add info route
server: optimize token decode
server: support flash sharded santacoder
security: image signing with cosign
security: image analysis with trivy
docker: improve image size

Fix

server: check cuda capability before importing flash attention
server: fix hf_transfer issue with private repositories
router: add auth token for private tokenizers

Misc

rust: update to 1.69

Contributors

njhill

Assets 2

11 Apr 18:32

OlivierDehaene

v0.5.0

6f0f1d7

v0.5.0

Features

server: add flash-attention based version of Llama
server: add flash-attention based version of Santacoder
server: support OPT models
router: make router input validation optional
docker: improve layer caching

Fix

server: improve token streaming decoding
server: fix escape charcaters in stop sequences
router: fix NCCL desync issues
router: use buckets for metrics histograms

Assets 2

30 Mar 15:29

OlivierDehaene

v0.4.3

fef1a1c

v0.4.3

Fix

router: fix OTLP distributed tracing initialization

Assets 2

30 Mar 15:10

OlivierDehaene

v0.4.2

84722f3

v0.4.2

Features

benchmark: tui based benchmarking tool
router: Clear cache on error
server: Add mypy-protobuf
server: reduce mlp and attn in one op for flash neox
image: aws sagemaker compatible image

Fix

server: avoid try/except to determine the kind of AutoModel
server: fix flash neox rotary embedding

Assets 2

26 Mar 14:38

OlivierDehaene

v0.4.1

ab5fd8c

v0.4.1

Features

server: New faster GPTNeoX implementation based on flash attention

Fix

server: fix input-length discrepancy between Rust and Python tokenizers

Assets 2

09 Mar 15:10

OlivierDehaene

v0.4.0

411d624

v0.4.0

Features

router: support best_of sampling
router: support left truncation
server: support typical sampling
launcher: allow local models
clients: add text-generation Python client
launcher: allow parsing num_shard from CUDA_VISIBLE_DEVICES

Fix

server: do not warp prefill logits
server: fix formatting issues in generate_stream tokens
server: fix galactica batch
server: fix index out of range issue with watermarking

Assets 2

03 Mar 17:42

OlivierDehaene

v0.3.2

1c19b09

v0.3.2

Features

router: add support for huggingface api-inference
server: add logits watermark with "A Watermark for Large Language Models"
server: use a fixed transformers commit

Fix

launcher: add missing parameters to launcher
server: update to hf_transfer==0.1.2 to fix corrupted files issue

Assets 2

24 Feb 12:27

OlivierDehaene

v0.3.1

4b1c972

v0.3.1

Features

server: allocate full attention mask to decrease latency
server: enable hf-transfer for insane download speeds
router: add CORS options

Fix

server: remove position_ids from galactica forward

Assets 2

16 Feb 16:33

OlivierDehaene

v0.3.0

c720555

v0.3.0

Features

server: support t5 models
router: add max_total_tokens and empty_input validation
launcher: add the possibility to disable custom CUDA kernels
server: add automatic safetensors conversion
router: add prometheus scrape endpoint
server, router: add distributed tracing

Fix

launcher: copy current env vars to subprocesses
docker: add note around shared memory

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Features

Fix

Misc

New Contributors

Contributors

Uh oh!

Features

Fix

Misc

Contributors

Uh oh!

Features

Fix

Uh oh!

Fix

Uh oh!

Features

Fix

Uh oh!

Features

Fix

Uh oh!

Features

Fix

Uh oh!

Features

Fix

Uh oh!

Features

Fix

Uh oh!

Features

Fix

Uh oh!

Releases: huggingface/text-generation-inference

v0.7.0

Features

Fix

Misc

New Contributors

Contributors

Uh oh!

v0.6.0

Features

Fix

Misc

Contributors

Uh oh!

v0.5.0

Features

Fix

Uh oh!

v0.4.3

Fix

Uh oh!

v0.4.2

Features

Fix

Uh oh!

v0.4.1

Features

Fix

Uh oh!

v0.4.0

Features

Fix

Uh oh!

v0.3.2

Features

Fix

Uh oh!

v0.3.1

Features

Fix

Uh oh!

v0.3.0

Features

Fix

Uh oh!