Skip to content

Commit 6c6cd93

Browse files
v1.3.0 (#313)
1 parent 3c2f9ba commit 6c6cd93

File tree

7 files changed

+163
-79
lines changed

7 files changed

+163
-79
lines changed

Cargo.lock

Lines changed: 7 additions & 7 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ members = [
1111
resolver = "2"
1212

1313
[workspace.package]
14-
version = "1.2.3"
14+
version = "1.3.0"
1515
edition = "2021"
1616
authors = ["Olivier Dehaene"]
1717
homepage = "https://github.com/huggingface/text-embeddings-inference"

README.md

Lines changed: 89 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -63,36 +63,37 @@ Ember, GTE and E5. TEI implements many features such as:
6363

6464
#### Text Embeddings
6565

66-
You can use any JinaBERT model with Alibi or absolute positions or any BERT, CamemBERT, RoBERTa, or XLM-RoBERTa model
67-
with absolute positions in `text-embeddings-inference`.
66+
Text Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT
67+
model with Alibi positions and Mistral, Alibabe GTE models with Rope positions.
6868

69-
**Support for other model types will be added in the future.**
69+
Below are some examples of the currently supported models:
7070

71-
Examples of supported models:
71+
| MTEB Rank | Model Size | Model Type | Model ID |
72+
|-----------|----------------|-------------|--------------------------------------------------------------------------------------------------|
73+
| 1 | 7B (Very Slow) | Mistral | [Salesforce/SFR-Embedding-2_R](https://hf.co/Salesforce/SFR-Embedding-2_R) |
74+
| 15 | 0.4B | Alibaba GTE | [Alibaba-NLP/gte-large-en-v1.5](Alibaba-NLP/gte-large-en-v1.5) |
75+
| 20 | 0.3B | Bert | [WhereIsAI/UAE-Large-V1](https://hf.co/WhereIsAI/UAE-Large-V1) |
76+
| 24 | 0.5B | XLM-RoBERTa | [intfloat/multilingual-e5-large-instruct](https://hf.co/intfloat/multilingual-e5-large-instruct) |
77+
| N/A | 0.1B | NomicBert | [nomic-ai/nomic-embed-text-v1](https://hf.co/nomic-ai/nomic-embed-text-v1) |
78+
| N/A | 0.1B | NomicBert | [nomic-ai/nomic-embed-text-v1.5](https://hf.co/nomic-ai/nomic-embed-text-v1.5) |
79+
| N/A | 0.1B | JinaBERT | [jinaai/jina-embeddings-v2-base-en](https://hf.co/jinaai/jina-embeddings-v2-base-en) |
80+
| N/A | 0.1B | JinaBERT | [jinaai/jina-embeddings-v2-base-code](https://hf.co/jinaai/jina-embeddings-v2-base-code) |
7281

73-
| MTEB Rank | Model Type | Model ID |
74-
|-----------|-------------|--------------------------------------------------------------------------------------------------|
75-
| 6 | Bert | [WhereIsAI/UAE-Large-V1](https://hf.co/WhereIsAI/UAE-Large-V1) |
76-
| 10 | XLM-RoBERTa | [intfloat/multilingual-e5-large-instruct](https://hf.co/intfloat/multilingual-e5-large-instruct) |
77-
| N/A | NomicBert | [nomic-ai/nomic-embed-text-v1](https://hf.co/nomic-ai/nomic-embed-text-v1) |
78-
| N/A | NomicBert | [nomic-ai/nomic-embed-text-v1.5](https://hf.co/nomic-ai/nomic-embed-text-v1.5) |
79-
| N/A | JinaBERT | [jinaai/jina-embeddings-v2-base-en](https://hf.co/jinaai/jina-embeddings-v2-base-en) |
80-
| N/A | JinaBERT | [jinaai/jina-embeddings-v2-base-code](https://hf.co/jinaai/jina-embeddings-v2-base-code) |
8182

82-
You can explore the list of best performing text embeddings
83-
models [here](https://huggingface.co/spaces/mteb/leaderboard).
83+
To explore the list of best performing text embeddings models, visit the
84+
[Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
8485

8586
#### Sequence Classification and Re-Ranking
8687

87-
`text-embeddings-inference` v0.4.0 added support for Bert, CamemBERT, RoBERTa and XLM-RoBERTa Sequence Classification models.
88+
Text Embeddings Inference currently supports CamemBERT, and XLM-RoBERTa Sequence Classification models with absolute positions.
8889

89-
Example of supported sequence classification models:
90+
Below are some examples of the currently supported models:
9091

91-
| Task | Model Type | Model ID |
92-
|--------------------|-------------|---------------------------------------------------------------------------------------------|
93-
| Re-Ranking | XLM-RoBERTa | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) |
94-
| Re-Ranking | XLM-RoBERTa | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) |
95-
| Sentiment Analysis | RoBERTa | [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions) |
92+
| Task | Model Type | Model ID | Revision |
93+
|--------------------|-------------|---------------------------------------------------------------------------------------------|-------------|
94+
| Re-Ranking | XLM-RoBERTa | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | `refs/pr/4` |
95+
| Re-Ranking | XLM-RoBERTa | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | `refs/pr/5` |
96+
| Sentiment Analysis | RoBERTa | [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions) | |
9697

9798
### Docker
9899

@@ -101,7 +102,7 @@ model=BAAI/bge-large-en-v1.5
101102
revision=refs/pr/5
102103
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
103104

104-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model --revision $revision
105+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model --revision $revision
105106
```
106107

107108
And then you can make requests like
@@ -163,9 +164,11 @@ Options:
163164
[env: POOLING=]
164165
165166
Possible values:
166-
- cls: Select the CLS token as embedding
167-
- mean: Apply Mean pooling to the model embeddings
168-
- splade: Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only available if the loaded model is a `ForMaskedLM` Transformer model
167+
- cls: Select the CLS token as embedding
168+
- mean: Apply Mean pooling to the model embeddings
169+
- splade: Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only
170+
available if the loaded model is a `ForMaskedLM` Transformer model
171+
- last-token: Select the last token as embedding
169172
170173
--max-concurrent-requests <MAX_CONCURRENT_REQUESTS>
171174
The maximum amount of concurrent requests for this particular deployment.
@@ -199,6 +202,37 @@ Options:
199202
[env: MAX_CLIENT_BATCH_SIZE=]
200203
[default: 32]
201204
205+
--auto-truncate
206+
Automatically truncate inputs that are longer than the maximum supported size
207+
208+
Unused for gRPC servers
209+
210+
[env: AUTO_TRUNCATE=]
211+
212+
--default-prompt-name <DEFAULT_PROMPT_NAME>
213+
The name of the prompt that should be used by default for encoding. If not set, no prompt will be applied.
214+
215+
Must be a key in the `Sentence Transformers` configuration `prompts` dictionary.
216+
217+
For example if ``default_prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the
218+
sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because
219+
the prompt text will be prepended before any text to encode.
220+
221+
The argument '--default-prompt-name <DEFAULT_PROMPT_NAME>' cannot be used with '--default-prompt <DEFAULT_PROMPT>`
222+
223+
[env: DEFAULT_PROMPT_NAME=]
224+
225+
--default-prompt <DEFAULT_PROMPT>
226+
The prompt that should be used by default for encoding. If not set, no prompt will be applied.
227+
228+
For example if ``default_prompt`` is "query: " then the sentence "What is the capital of France?" will be
229+
encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text
230+
to encode.
231+
232+
The argument '--default-prompt <DEFAULT_PROMPT>' cannot be used with '--default-prompt-name <DEFAULT_PROMPT_NAME>`
233+
234+
[env: DEFAULT_PROMPT=]
235+
202236
--hf-api-token <HF_API_TOKEN>
203237
Your HuggingFace hub token
204238
@@ -224,9 +258,10 @@ Options:
224258
[default: /tmp/text-embeddings-inference-server]
225259
226260
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
227-
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
261+
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk
262+
for instance
228263
229-
[env: HUGGINGFACE_HUB_CACHE=/data]
264+
[env: HUGGINGFACE_HUB_CACHE=]
230265
231266
--payload-limit <PAYLOAD_LIMIT>
232267
Payload size limit in bytes
@@ -239,7 +274,8 @@ Options:
239274
--api-key <API_KEY>
240275
Set an api key for request authorization.
241276
242-
By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token.
277+
By default the server responds to every request. With an api key set, the requests must have the Authorization
278+
header set with the api key as Bearer token.
243279
244280
[env: API_KEY=]
245281
@@ -254,12 +290,14 @@ Options:
254290
[env: OTLP_ENDPOINT=]
255291
256292
--otlp-service-name <OTLP_SERVICE_NAME>
257-
The service name for opentelemetry.
293+
The service name for opentelemetry. e.g. `text-embeddings-inference.server`
258294
259295
[env: OTLP_SERVICE_NAME=]
260296
[default: text-embeddings-inference.server]
261297
262298
--cors-allow-origin <CORS_ALLOW_ORIGIN>
299+
Unused for gRPC servers
300+
263301
[env: CORS_ALLOW_ORIGIN=]
264302
```
265303

@@ -269,13 +307,13 @@ Text Embeddings Inference ships with multiple Docker images that you can use to
269307

270308
| Architecture | Image |
271309
|-------------------------------------|-------------------------------------------------------------------------|
272-
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 |
310+
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.3 |
273311
| Volta | NOT SUPPORTED |
274-
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.2 (experimental) |
275-
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.2 |
276-
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.2 |
277-
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.2 |
278-
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.2 (experimental) |
312+
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.3 (experimental) |
313+
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.3 |
314+
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.3 |
315+
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.3 |
316+
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.3 (experimental) |
279317

280318
**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
281319
You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.
@@ -304,7 +342,7 @@ model=<your private model>
304342
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
305343
token=<your cli READ token>
306344

307-
docker run --gpus all -e HF_API_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model
345+
docker run --gpus all -e HF_API_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model
308346
```
309347

310348
### Using Re-rankers models
@@ -322,7 +360,7 @@ model=BAAI/bge-reranker-large
322360
revision=refs/pr/4
323361
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
324362

325-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model --revision $revision
363+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model --revision $revision
326364
```
327365

328366
And then you can rank the similarity between a query and a list of texts with:
@@ -342,7 +380,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
342380
model=SamLowe/roberta-base-go_emotions
343381
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
344382

345-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model
383+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model
346384
```
347385

348386
Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
@@ -362,7 +400,7 @@ You can choose to activate SPLADE pooling for Bert and Distilbert MaskedLM archi
362400
model=naver/efficient-splade-VI-BT-large-query
363401
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
364402

365-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model --pooling splade
403+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3 --model-id $model --pooling splade
366404
```
367405

368406
Once you have deployed the model you can use the `/embed_sparse` endpoint to get the sparse embedding:
@@ -382,7 +420,8 @@ by setting the address to an OTLP collector with the `--otlp-endpoint` argument.
382420
### gRPC
383421

384422
`text-embeddings-inference` offers a gRPC API as an alternative to the default HTTP API for high performance
385-
deployments. The API protobuf definition can be found [here](https://github.com/huggingface/text-embeddings-inference/blob/main/proto/tei.proto).
423+
deployments. The API protobuf definition can be
424+
found [here](https://github.com/huggingface/text-embeddings-inference/blob/main/proto/tei.proto).
386425

387426
You can use the gRPC API by adding the `-grpc` tag to any TEI Docker image. For example:
388427

@@ -391,7 +430,7 @@ model=BAAI/bge-large-en-v1.5
391430
revision=refs/pr/5
392431
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
393432

394-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2-grpc --model-id $model --revision $revision
433+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.3-grpc --model-id $model --revision $revision
395434
```
396435

397436
```shell
@@ -438,7 +477,8 @@ sudo apt-get install libssl-dev gcc -y
438477

439478
GPUs with Cuda compute capabilities < 7.5 are not supported (V100, Titan V, GTX 1000 series, ...).
440479

441-
Make sure you have Cuda and the nvidia drivers installed. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
480+
Make sure you have Cuda and the nvidia drivers installed. NVIDIA drivers on your device need to be compatible with CUDA
481+
version 12.2 or higher.
442482
You also need to add the nvidia binaries to your path:
443483

444484
```shell
@@ -499,12 +539,18 @@ docker build . -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_
499539
```
500540

501541
### Apple M1/M2 arm64 architectures
542+
502543
#### DISCLAIMER
503-
As explained here [MPS-Ready, ARM64 Docker Image](https://github.com/pytorch/pytorch/issues/81224), Metal / MPS is not supported via Docker. As such inference will be CPU bound and most likely pretty slow when using this docker image on an M1/M2 ARM CPU.
544+
545+
As explained here [MPS-Ready, ARM64 Docker Image](https://github.com/pytorch/pytorch/issues/81224), Metal / MPS is not
546+
supported via Docker. As such inference will be CPU bound and most likely pretty slow when using this docker image on an
547+
M1/M2 ARM CPU.
548+
504549
```
505550
docker build . -f Dockerfile-arm64 --platform=linux/arm64
506551
```
507552

508553
## Examples
554+
509555
- [Set up an Inference Endpoint with TEI](https://huggingface.co/learn/cookbook/automatic_embedding_tei_inference_endpoints)
510556
- [RAG containers with TEI](https://github.com/plaggy/rag-containers)

0 commit comments

Comments
 (0)