Skip to content

Commit 654bb03

Browse files
mikekgfbmreso
andauthored
Create distributed.md (#1438)
* Create distributed.md Initial documentation for use of distributed inference w/ torchchat. @mreso please review and update as appropriate. * Add support for extracting distributed inference tests in run-docs Add support for extracting distributed inference tests in run-docs * Update distributed.md * Update distributed.md * Update distributed.md * Update docs/distributed.md Co-authored-by: Matthias Reso <[email protected]> * Update docs/distributed.md Co-authored-by: Matthias Reso <[email protected]> * Update distributed.md Uncommenting section about generate subcommand w/ distributed inference after review by @mreso Also, Added HF login to make this fully self-contained * Update distributed.md Wording * Update distributed.md Wording and formatting * Update build_native.sh Update to C++11 ABI for AOTI, similar to ET --------- Co-authored-by: Matthias Reso <[email protected]>
1 parent bd7354e commit 654bb03

File tree

2 files changed

+142
-0
lines changed

2 files changed

+142
-0
lines changed

.ci/scripts/run-docs

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,3 +125,20 @@ if [ "$1" == "native" ]; then
125125
bash -x ./run-native.sh
126126
echo "::endgroup::"
127127
fi
128+
129+
if [ "$1" == "distributed" ]; then
130+
131+
echo "::group::Create script to run distributed"
132+
python3 torchchat/utils/scripts/updown.py --file docs/distributed.md > ./run-distributed.sh
133+
# for good measure, if something happened to updown processor,
134+
# and it did not error out, fail with an exit 1
135+
echo "exit 1" >> ./run-distributed.sh
136+
echo "::endgroup::"
137+
138+
echo "::group::Run distributed"
139+
echo "*******************************************"
140+
cat ./run-distributed.sh
141+
echo "*******************************************"
142+
bash -x ./run-distributed.sh
143+
echo "::endgroup::"
144+
fi

docs/distributed.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Distributed Inference with torchchat
2+
3+
torchchat supports distributed inference for large language models (LLMs) on GPUs seamlessly.
4+
At present, torchchat supports distributed inference using Python only.
5+
6+
## Installation
7+
The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
8+
9+
> [!TIP]
10+
> torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.
11+
12+
[skip default]: begin
13+
```bash
14+
git clone https://github.com/pytorch/torchchat.git
15+
cd torchchat
16+
python3 -m venv .venv
17+
source .venv/bin/activate
18+
./install/install_requirements.sh
19+
```
20+
[skip default]: end
21+
22+
[shell default]: ./install/install_requirements.sh
23+
24+
## Login to HF for Downloading Weights
25+
Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account. Create a Hugging Face user access token as documented here with the write role.
26+
27+
Log into Hugging Face:
28+
29+
[prefix default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}"
30+
31+
```
32+
huggingface-cli login
33+
```
34+
35+
## Enabling Distributed torchchat Inference
36+
37+
To enable distributed inference, use the option `--distributed`. In addition, `--tp <num>` and `--pp <num>`
38+
allow users to specify the types of parallelism to use where tp refers to tensor parallelism and pp to pipeline parallelism.
39+
40+
41+
## Generate Output with Distributed torchchat Inference
42+
43+
To generate output using distributed inference with 4 GPUs, you can use:
44+
```
45+
python3 torchchat.py generate llama3.1 --distributed --tp 2 --pp 2 --prompt "write me a story about a boy and his bear"
46+
```
47+
48+
49+
## Chat with Distributed torchchat Inference
50+
51+
This mode allows you to chat with an LLM in an interactive fashion with distributed Inference. The following example uses 4 GPUs:
52+
53+
[skip default]: begin
54+
```bash
55+
python3 torchchat.py chat llama3.1 --max-new-tokens 10 --distributed --tp 2 --pp 2
56+
```
57+
[skip default]: end
58+
59+
60+
## A Server with Distributed torchchat Inference
61+
62+
This mode exposes a REST API for interacting with a model.
63+
The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.
64+
65+
To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.
66+
67+
In one terminal, start the server to run with 4 GPUs:
68+
69+
[skip default]: begin
70+
71+
```bash
72+
python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2
73+
```
74+
[skip default]: end
75+
76+
<!--
77+
[shell default]: python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2 & server_pid=$! ; sleep 180 # wait for server to be ready to accept requests
78+
-->
79+
80+
In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
81+
82+
> [!NOTE]
83+
> Since this feature is under active development, not every parameter is consumed. See api/api.py for details on
84+
> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
85+
86+
<details>
87+
<summary>Example Query</summary>
88+
89+
Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.
90+
91+
**Example Input + Output**
92+
93+
```
94+
curl http://127.0.0.1:5000/v1/chat/completions \
95+
-H "Content-Type: application/json" \
96+
-d '{
97+
"model": "llama3.1",
98+
"stream": "true",
99+
"max_tokens": 200,
100+
"messages": [
101+
{
102+
"role": "system",
103+
"content": "You are a helpful assistant."
104+
},
105+
{
106+
"role": "user",
107+
"content": "Hello!"
108+
}
109+
]
110+
}'
111+
```
112+
[skip default]: begin
113+
```
114+
{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}
115+
```
116+
117+
[skip default]: end
118+
119+
<!--
120+
[shell default]: kill ${server_pid}
121+
-->
122+
123+
</details>
124+
125+
[end default]: end

0 commit comments

Comments
 (0)