-
Notifications
You must be signed in to change notification settings - Fork 419
feat: add embedding support to sgl backend #1481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Warning Rate limit exceeded@ishandhanani has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 21 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
WalkthroughA new embedding service infrastructure is introduced, including a worker class to serve embedding models, a frontend integration, protocol definitions for embedding requests, a configuration YAML, and a graph linkage file. The changes collectively enable serving embedding models via SGLang and Dynamo, with OpenAI-compatible request and response handling. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Frontend
participant SGLangEmbeddingWorker
participant SGLangEngine
Client->>Frontend: POST /generate (EmbeddingRequest)
Frontend->>SGLangEmbeddingWorker: generate(request)
SGLangEmbeddingWorker->>SGLangEngine: async_encode(input)
SGLangEngine-->>SGLangEmbeddingWorker: Embedding vectors
SGLangEmbeddingWorker-->>Frontend: OpenAI-compatible response
Frontend-->>Client: Embedding result
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
examples/sglang/utils/protocol.py (1)
73-78
: Consider honouringencoding_format
anddimensions
The spec adds
encoding_format
(float | base64) anddimensions
, but no component in this PR uses them.
At minimum, document that they are ignored or raiseNotImplementedError
when the caller sets them; silent ignore leads to subtle bugs.🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 73-73: Too few public methods (0/2)
(R0903)
examples/sglang/components/frontend.py (1)
21-22
: Unused import can be dropped
SGLangEmbeddingWorker
is already referenced throughdepends()
below; the explicit import is unnecessary unless you access the class elsewhere in this module.-from components.embedding_worker import SGLangEmbeddingWorker +from components.embedding_worker import SGLangEmbeddingWorker # noqa: F401 – keep if PyDynamo’s import-time registration relies on itIf registration does not rely on the import, consider removing it.
examples/sglang/graphs/embedding.py (1)
17-20
: File is fine but missing__all__
guardVery small nit: add
__all__ = []
(or similar) to avoid accidental re-exports when this module is imported elsewhere.examples/sglang/components/embedding_worker.py (3)
20-24
: Remove unused imports
asyncio
,random
,socket
,Any
, anddepends
are unused – they trigger Ruff F401 and increase cold-start time.-import asyncio -import random -import socket -from typing import Any - -from dynamo.sdk import async_on_start, depends, dynamo_context, endpoint, service +from dynamo.sdk import async_on_start, dynamo_context, endpoint, service🧰 Tools
🪛 Ruff (0.11.9)
20-20:
asyncio
imported but unusedRemove unused import:
asyncio
(F401)
22-22:
random
imported but unusedRemove unused import:
random
(F401)
23-23:
socket
imported but unusedRemove unused import:
socket
(F401)
24-24:
typing.Any
imported but unusedRemove unused import:
typing.Any
(F401)
82-97
: Robustness of_transform_response
ret_item["meta_info"]["prompt_tokens"]
will raiseKeyError
if the engine omits stats. Use.get()
with fallback or guard explicitly.- prompt_tokens += ret_item["meta_info"]["prompt_tokens"] + prompt_tokens += ret_item.get("meta_info", {}).get("prompt_tokens", 0)
82-97
: Consider returningencoding_format
-aware embeddingsIf
encoding_format == "base64"
the embedding vectors must be base64-encoded per OpenAI spec. Handle here after converting format.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/sglang/components/embedding_worker.py
(1 hunks)examples/sglang/components/frontend.py
(2 hunks)examples/sglang/configs/embedding.yaml
(1 hunks)examples/sglang/graphs/embedding.py
(1 hunks)examples/sglang/utils/protocol.py
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
examples/sglang/graphs/embedding.py (2)
examples/sglang/components/frontend.py (1)
Frontend
(59-89)examples/sglang/components/embedding_worker.py (1)
SGLangEmbeddingWorker
(43-106)
🪛 YAMLlint (1.37.1)
examples/sglang/configs/embedding.yaml
[error] 11-11: duplication of key "is-embedding" in mapping
(key-duplicates)
🪛 Ruff (0.11.9)
examples/sglang/components/embedding_worker.py
20-20: asyncio
imported but unused
Remove unused import: asyncio
(F401)
22-22: random
imported but unused
Remove unused import: random
(F401)
23-23: socket
imported but unused
Remove unused import: socket
(F401)
24-24: typing.Any
imported but unused
Remove unused import: typing.Any
(F401)
31-31: dynamo.sdk.depends
imported but unused
Remove unused import: dynamo.sdk.depends
(F401)
🪛 Pylint (3.3.7)
examples/sglang/components/embedding_worker.py
[error] 56-56: Class 'SGLangEmbeddingWorker' has no 'dynamo_address' member
(E1101)
[refactor] 70-70: Unnecessary use of a comprehension, use list(request.input) instead.
(R1721)
examples/sglang/utils/protocol.py
[refactor] 73-73: Too few public methods (0/2)
(R0903)
🔇 Additional comments (2)
examples/sglang/components/frontend.py (1)
60-62
: Dependency injected but never referenced
embedding_worker = depends(SGLangEmbeddingWorker)
sets up the wiring, yet theFrontend
class never uses the dependency.
Verify that:
- Dynamo still spins the worker even if the attribute is unused (it usually does, but double-check).
- You expose an HTTP route or RPC that actually forwards embedding requests to this worker, otherwise the frontend cannot serve embeddings.
examples/sglang/components/embedding_worker.py (1)
56-58
: Static analysis flags missingdynamo_address
SGLangEmbeddingWorker.dynamo_address()
is called but not defined in this class.
If the method is injected by Dynamo at runtime the call is safe; otherwise, implement it or import the mix-in that provides it.🧰 Tools
🪛 Pylint (3.3.7)
[error] 56-56: Class 'SGLangEmbeddingWorker' has no 'dynamo_address' member
(E1101)
WalkthroughA new embedding service infrastructure is added, introducing an embedding worker component, its configuration, and protocol definitions. The worker is integrated with the frontend and registered in the service graph. A new YAML config and protocol types for embedding requests are provided, supporting both string and tokenized inputs, and aligning the response format with OpenAI's API. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Frontend
participant SGLangEmbeddingWorker
participant SGLangEngine
participant DynamoRuntime
Client->>Frontend: POST /generate (EmbeddingRequest)
Frontend->>SGLangEmbeddingWorker: generate(request)
SGLangEmbeddingWorker->>SGLangEngine: encode(input)
SGLangEngine-->>SGLangEmbeddingWorker: embedding result
SGLangEmbeddingWorker->>SGLangEmbeddingWorker: _transform_response(result)
SGLangEmbeddingWorker-->>Frontend: OpenAI-compatible embedding response
Frontend-->>Client: Embedding response
Note over SGLangEmbeddingWorker, DynamoRuntime: On startup, SGLangEmbeddingWorker registers the model with DynamoRuntime
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
examples/sglang/components/frontend.py (2)
21-22
: Prefer explicit-relative import to avoid package-resolution surprisesInside
examples/sglang/components/
, importing a sibling module via the absolute namecomponents.embedding_worker
only works ifexamples/sglang
is onPYTHONPATH
at runtime.
Using a relative import (from .embedding_worker import SGLangEmbeddingWorker
) is safer and keeps the component self-contained.-from components.embedding_worker import SGLangEmbeddingWorker +from .embedding_worker import SGLangEmbeddingWorker
60-62
: Unused dependency: wire it or drop it
embedding_worker
is declared as a dependency but never referenced insideFrontend
.
If the intent is only service discovery, add a short comment so future readers know it is intentional; otherwise, remove it to keep the dependency graph minimal.examples/sglang/components/embedding_worker.py (2)
20-25
: Remove unused imports to keep module clean
asyncio
,random
,socket
,Any
, anddepends
are never referenced.-import asyncio -import random -import socket -from typing import Any -... -from dynamo.sdk import async_on_start, depends, dynamo_context, endpoint, service +from dynamo.sdk import async_on_start, dynamo_context, endpoint, service🧰 Tools
🪛 Ruff (0.11.9)
20-20:
asyncio
imported but unusedRemove unused import:
asyncio
(F401)
22-22:
random
imported but unusedRemove unused import:
random
(F401)
23-23:
socket
imported but unusedRemove unused import:
socket
(F401)
24-24:
typing.Any
imported but unusedRemove unused import:
typing.Any
(F401)
82-106
: Consider returning a plain dict instead of yielding once
@endpoint()
can accept either a return value or an async generator. If only one response is produced,return response
avoids the extra generator wrapper and is easier for clients.- response = self._transform_response(g, request.model) - yield response + response = self._transform_response(g, request.model) + return responseVerify that upstream expectations (e.g., streaming vs. single reply) align with this.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/sglang/components/embedding_worker.py
(1 hunks)examples/sglang/components/frontend.py
(2 hunks)examples/sglang/configs/embedding.yaml
(1 hunks)examples/sglang/graphs/embedding.py
(1 hunks)examples/sglang/utils/protocol.py
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
examples/sglang/graphs/embedding.py (2)
examples/sglang/components/frontend.py (1)
Frontend
(59-89)examples/sglang/components/embedding_worker.py (1)
SGLangEmbeddingWorker
(43-106)
🪛 YAMLlint (1.37.1)
examples/sglang/configs/embedding.yaml
[error] 11-11: duplication of key "is-embedding" in mapping
(key-duplicates)
🪛 Pylint (3.3.7)
examples/sglang/utils/protocol.py
[refactor] 73-73: Too few public methods (0/2)
(R0903)
examples/sglang/components/embedding_worker.py
[error] 56-56: Class 'SGLangEmbeddingWorker' has no 'dynamo_address' member
(E1101)
[refactor] 70-70: Unnecessary use of a comprehension, use list(request.input) instead.
(R1721)
🪛 Ruff (0.11.9)
examples/sglang/components/embedding_worker.py
20-20: asyncio
imported but unused
Remove unused import: asyncio
(F401)
22-22: random
imported but unused
Remove unused import: random
(F401)
23-23: socket
imported but unused
Remove unused import: socket
(F401)
24-24: typing.Any
imported but unused
Remove unused import: typing.Any
(F401)
31-31: dynamo.sdk.depends
imported but unused
Remove unused import: dynamo.sdk.depends
(F401)
🔇 Additional comments (2)
examples/sglang/graphs/embedding.py (1)
17-21
: Graph looks correct – link established
The single-line link cleanly connectsFrontend
toSGLangEmbeddingWorker
; no issues spotted.examples/sglang/utils/protocol.py (1)
73-79
: Minor: zero-method Pydantic model warning can be ignoredPylint flags too-few-public-methods on
EmbeddingRequest
; that’s normal for schema classes. No change needed.🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 73-73: Too few public methods (0/2)
(R0903)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
examples/sglang/utils/protocol.py (1)
66-66
:⚠️ Potential issue
EmbeddingInput
still exposes numeric variants that are not supported downstream
The worker implementation (SGLangEmbeddingWorker.generate
) still only distinguishesstr
vslist
and forwards the items as-is toasync_encode
, which expects text. KeepingList[int]
/List[List[int]]
here will crash at runtime if a client really sends token-ids.-EmbeddingInput = Union[str, List[str], List[int], List[List[int]]] +# NOTE: keep numeric inputs disabled until the worker properly converts them +EmbeddingInput = Union[str, List[str]]If you want to keep the numeric forms, extend the worker to detect integers and
tokenizer.decode()
them before callingasync_encode
.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/sglang/README.md
(1 hunks)examples/sglang/utils/protocol.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- examples/sglang/README.md
🧰 Additional context used
🪛 Pylint (3.3.7)
examples/sglang/utils/protocol.py
[refactor] 71-71: Too few public methods (0/2)
(R0903)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
examples/sglang/components/embedding_worker.py (1)
60-82
:⚠️ Potential issue
encoding_format
still ignored – honour spec or reject unsupported valuesThis block never inspects
request.encoding_format
; callers asking for"base64"
will silently receive raw float vectors, violating the OpenAI contract raised in the previous review.@@ - g = await self.engine.async_encode( - prompt=prompt, - ) - - response = self._transform_response(g, request.model) + g = await self.engine.async_encode(prompt=prompt) + + # honour encoding_format per OpenAI spec + if request.encoding_format is None or request.encoding_format == "float": + transformed = self._transform_response(g, request.model) + elif request.encoding_format == "base64": + import base64, struct + for item in g: + item["embedding"] = base64.b64encode( + struct.pack(f"{len(item['embedding'])}f", *item["embedding"]) + ).decode() + transformed = self._transform_response(g, request.model) + else: + raise ValueError( + f"Unsupported encoding_format '{request.encoding_format}'. " + "Use 'float' (default) or 'base64'." + ) + + response = transformed yield response[follows previous feedback]
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 72-72: Unnecessary use of a comprehension, use list(request.input) instead.
(R1721)
🧹 Nitpick comments (1)
examples/sglang/components/embedding_worker.py (1)
66-74
: Minor refactor: avoid comprehension & validate nested inputsA simple
list(request.input)
avoids the unnecessary comprehension and pylint warning (R1721
).
Additionally, nested lists or non-string items are quietly forwarded to SGLang, which will likely fail deeper in the stack. Consider a stricter validation.- if len(request.input) == 1: - prompt = request.input[0] - else: - prompt = [i for i in request.input] + if len(request.input) == 1: + prompt = request.input[0] + else: + if not all(isinstance(i, str) for i in request.input): + raise ValueError("Embedding input list must contain only strings") + prompt = list(request.input)🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 72-72: Unnecessary use of a comprehension, use list(request.input) instead.
(R1721)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/sglang/components/embedding_worker.py
(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
examples/sglang/components/embedding_worker.py
[error] 51-51: Class 'SGLangEmbeddingWorker' has no 'dynamo_address' member
(E1101)
[refactor] 72-72: Unnecessary use of a comprehension, use list(request.input) instead.
(R1721)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (2)
examples/sglang/components/embedding_worker.py (2)
48-58
: Possible runtimeAttributeError
– double-checkdynamo_address
helper
SGLangEmbeddingWorker.dynamo_address()
is invoked as a class attribute, yet no such member is defined in this class or its bases. The decorator may inject it at runtime, but static analysis flags it (E1101
) and IDEs will treat it as an error.comp_ns, comp_name = self.dynamo_address() # safer if decorator binds to the *instance*If the helper is not injected, register_llm will raise at start-up.
Please verify the mix-in behaviour or switch to the instance form as shown.🧰 Tools
🪛 Pylint (3.3.7)
[error] 51-51: Class 'SGLangEmbeddingWorker' has no 'dynamo_address' member
(E1101)
80-82
: Confirm whetheryield
is intentionalOpenAI’s embedding endpoint is non-streaming; returning a plain dict would suffice and avoids an extra async generator layer. If Dynamo expects streaming, keep as is; otherwise
return response
is simpler.
if len(request.input) == 1: | ||
prompt = request.input[0] | ||
else: | ||
prompt = [i for i in request.input] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need list comprehension? we've already asserted isinstance(request.input, list)
should we concat the strings or take first string in the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm.
There is a redundant list comprehension logic - we need to check/test this code path
@@ -57,6 +58,7 @@ class FrontendConfig(BaseModel): | |||
) | |||
class Frontend: | |||
worker = depends(SGLangWorker) | |||
embedding_worker = depends(SGLangEmbeddingWorker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need embedding_worker for all deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's required to be here for things to work
Summary by CodeRabbit
closes: DYN-532