You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-11Lines changed: 10 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -182,7 +182,7 @@ python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy an
182
182
[skip default]: end
183
183
184
184
### Server
185
-
This mode exposes a REST API for interacting with a model.
185
+
This mode exposes a REST API for interacting with a model.
186
186
The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.
187
187
188
188
To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.
@@ -255,13 +255,14 @@ Use the "Max Response Tokens" slider to limit the maximum number of tokens gener
255
255
## Desktop/Server Execution
256
256
257
257
### AOTI (AOT Inductor)
258
-
[AOTI](https://pytorch.org/blog/pytorch2-2/) compiles models before execution for faster inference. The process creates a [DSO](https://en.wikipedia.org/wiki/Shared_library)model (represented by a file with extension `.so`)
258
+
[AOTI](https://pytorch.org/blog/pytorch2-2/) compiles models before execution for faster inference. The process creates a zipped PT2 file containing all the artifacts generated by AOTInductor, and a [.so](https://en.wikipedia.org/wiki/Shared_library) file with the runnable contents
259
259
that is then loaded for inference. This can be done with both Python and C++ enviroments.
260
260
261
261
The following example exports and executes the Llama3.1 8B Instruct
262
262
model. The first command compiles and performs the actual export.
0 commit comments