Skip to content

Commit c38a7d7

Browse files
v2.0.0 (#1736)
1 parent 275caa0 commit c38a7d7

14 files changed

+124
-133
lines changed

Cargo.lock

Lines changed: 95 additions & 118 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ members = [
99
resolver = "2"
1010

1111
[workspace.package]
12-
version = "1.4.5"
12+
version = "2.0.0"
1313
edition = "2021"
1414
authors = ["Olivier Dehaene"]
1515
homepage = "https://github.com/huggingface/text-generation-inference"

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ For a detailed starting guide, please see the [Quick Tour](https://huggingface.c
7676
model=HuggingFaceH4/zephyr-7b-beta
7777
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
7878

79-
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
79+
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
8080
```
8181

8282
And then you can make requests like
@@ -90,7 +90,7 @@ curl 127.0.0.1:8080/generate_stream \
9090

9191
**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
9292

93-
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4-rocm --model-id $model` instead of the command above.
93+
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0-rocm --model-id $model` instead of the command above.
9494

9595
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
9696
```
@@ -120,7 +120,7 @@ model=meta-llama/Llama-2-7b-chat-hf
120120
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
121121
token=<your cli READ token>
122122

123-
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
123+
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
124124
```
125125

126126
### A note on Shared Memory (shm)

docs/openapi.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"name": "Apache 2.0",
1111
"url": "https://www.apache.org/licenses/LICENSE-2.0"
1212
},
13-
"version": "1.4.5"
13+
"version": "2.0.0"
1414
},
1515
"paths": {
1616
"/": {

integration-tests/models/__snapshots__/test_tools_llama/test_flash_llama_grammar_no_tools.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
"id": "",
1818
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
1919
"object": "text_completion",
20-
"system_fingerprint": "1.4.5-native",
20+
"system_fingerprint": "2.0.0-native",
2121
"usage": {
2222
"completion_tokens": 100,
2323
"prompt_tokens": 60,

integration-tests/models/__snapshots__/test_tools_llama/test_flash_llama_grammar_tools.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
"id": "",
3232
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
3333
"object": "text_completion",
34-
"system_fingerprint": "1.4.5-native",
34+
"system_fingerprint": "2.0.0-native",
3535
"usage": {
3636
"completion_tokens": 29,
3737
"prompt_tokens": 316,

integration-tests/models/__snapshots__/test_tools_llama/test_flash_llama_grammar_tools_auto.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
"id": "",
3232
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
3333
"object": "text_completion",
34-
"system_fingerprint": "1.4.5-native",
34+
"system_fingerprint": "2.0.0-native",
3535
"usage": {
3636
"completion_tokens": 29,
3737
"prompt_tokens": 316,

integration-tests/models/__snapshots__/test_tools_llama/test_flash_llama_grammar_tools_choice.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
"id": "",
3131
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
3232
"object": "text_completion",
33-
"system_fingerprint": "1.4.5-native",
33+
"system_fingerprint": "2.0.0-native",
3434
"usage": {
3535
"completion_tokens": 21,
3636
"prompt_tokens": 187,

integration-tests/models/__snapshots__/test_tools_llama/test_flash_llama_grammar_tools_stream.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,5 @@
2323
"id": "",
2424
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
2525
"object": "text_completion",
26-
"system_fingerprint": "1.4.5-native"
26+
"system_fingerprint": "2.0.0-native"
2727
}

integration-tests/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "text-generation-integration-tests"
3-
version = "1.4.5"
3+
version = "2.0.0"
44
description = "Text Generation Inference integration tests"
55
authors = ["Nicolas Patry <[email protected]>"]
66

server/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "text-generation-server"
3-
version = "1.4.5"
3+
version = "2.0.0"
44
description = "Text Generation Inference Python gRPC Server"
55
authors = ["Olivier Dehaene <[email protected]>"]
66

server/text_generation_server/interceptor.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,10 @@ async def intercept(
2323
method_name = method_name.split("/")[-1]
2424
logger.exception(f"Method {method_name} encountered an error.")
2525

26+
# Runtime Error cannot be recovered from
27+
if isinstance(err, RuntimeError):
28+
exit(1)
29+
2630
if torch.cuda.is_available():
2731
torch.cuda.empty_cache()
2832

server/text_generation_server/models/cache_manager.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,10 @@ def allocate(
5555
):
5656
# Get free blocks indices by finding values in mask that are not set to 0
5757
free_block_indices = self.free_block_mask.nonzero()
58-
assert (
59-
len(free_block_indices) >= blocks
60-
), f"Out of available cache blocks: asked {blocks}, only {len(free_block_indices)} free blocks"
58+
if blocks > len(free_block_indices):
59+
raise RuntimeError(
60+
f"Out of available cache blocks: asked {blocks}, only {len(free_block_indices)} free blocks"
61+
)
6162

6263
# Slice by the number of required blocks
6364
block_indices = free_block_indices[:blocks]

server/text_generation_server/utils/layers.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -503,6 +503,10 @@ def forward(
503503
self, input: torch.Tensor
504504
) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
505505
logits = self.lm_head(input)
506+
# If we have too many tokens, we skip speculative logits
507+
if input.shape[0] > 128:
508+
return logits, None
509+
506510
speculative_logits = self.medusa(input)
507511
return logits, speculative_logits
508512

@@ -549,6 +553,11 @@ def __init__(self, config, prefix, weights):
549553
self.lm_head = TensorParallelHead.load(config, prefix, weights)
550554

551555
def forward(self, x):
556+
# If we have too many tokens, we skip speculative logits
557+
if x.shape[0] > 128:
558+
logits = self.lm_head(x)
559+
return logits, None
560+
552561
size = x.shape[-1]
553562
block_size = (size + self.world_size - 1) // self.world_size
554563
start = self.rank * block_size

0 commit comments

Comments
 (0)