Skip to content

Commit 242d6ac

Browse files
Gasoonjiafacebook-github-bot
authored andcommitted
add dynamic export into llm manual (#3202)
Summary: This diff adds dynamic export into llm manual, including code and related comments. Also update other documentations for better understanding. Differential Revision: D56365041
1 parent da77105 commit 242d6ac

File tree

1 file changed

+84
-17
lines changed

1 file changed

+84
-17
lines changed

docs/source/llm/getting-started.md

Lines changed: 84 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,18 @@
11
# Getting Started with LLMs via ExecuTorch
22

3+
Welcome to LLM Manual! This manual is designed to provide a practical example to leverage
4+
ExecuTorch in onboarding your own Large Language Models (LLMs). Our primary goal is to offer
5+
a clear and concise guideline on how to integrate our system with your own LLMs.
6+
7+
Please note that this project is intended as a demonstration and not as a fully functional
8+
example with optimal performance. As such, certain components such as the sampler, tokenizer,
9+
and others are provided in their bare minimum versions solely for demonstration purposes.
10+
Consequently, the results produced by the model may vary and might not always be optimal.
11+
12+
We encourage users to use this project as a starting point and adapt it to their specific needs,
13+
which includes creating your own versions of the tokenizer, sampler, acceleration backends, and
14+
other components. We hope this project serves as a useful guide in your journey with LLMs and ExecuTorch.
15+
316
### Table Of Contents
417

518

@@ -141,13 +154,23 @@ model = GPT.from_pretrained('gpt2')
141154

142155
# Create example inputs. This is used in the export process to provide
143156
# hints on the expected shape of the model input.
144-
example_inputs = (torch.randint(0, 100, (1, 8), dtype=torch.long), )
157+
example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), )
158+
159+
# Set up dynamic shape configuration, which makes the input tensors'
160+
# sizes during the runtime does not need to match the size of tensors
161+
# in `example_inputs`, but follow the rule dynamic shape configuration shares.
162+
# Here we set the range of 0th model input's 1st dimension as [0, model.config.block_size - 1]
163+
# Detials of dynamic shape and how to create it customized can follow
164+
# [ExecuTorch Concept](https://pytorch.org/executorch/stable/concepts.html#dynamic-shapes)
165+
dynamic_shape = (
166+
{1: torch.export.Dim("token_dim", max=model.config.block_size)},
167+
)
145168

146169
# Trace the model, converting it to a portable intermediate representation.
147170
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
148171
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
149-
m = capture_pre_autograd_graph(model, example_inputs)
150-
traced_model = export(m, example_inputs)
172+
m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes=dynamic_shape)
173+
traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)
151174

152175
# Convert the model into a runnable ExecuTorch program.
153176
edge_config = EdgeCompileConfig(_check_ir_validity=False)
@@ -204,11 +227,14 @@ output token by token. Each generated token is passed as input for the next run.
204227
```cpp
205228
// main.cpp
206229
230+
#define ENDOFTEXT 50256
231+
207232
std::string generate(
208233
Module& llm_model,
209234
std::string& prompt,
210235
BasicTokenizer& tokenizer,
211236
BasicSampler& sampler,
237+
size_t max_input_length,
212238
size_t max_output_length) {
213239
214240
// Convert the input text into a list of integers (tokens) that represents
@@ -237,14 +263,23 @@ std::string generate(
237263
238264
// Sample the next token from the logits.
239265
int64_t next_token = sampler.sample(logits);
266+
267+
// Break if we reached the end of the text.
268+
if (next_token == ENDOFTEXT) {
269+
break;
270+
}
271+
272+
// Add the next token to the output.
240273
output_tokens.push_back(next_token);
241274
242275
std::cout << tokenizer.decode({ next_token });
243276
std::cout.flush();
244277
245278
// Update next input.
246-
input_tokens.erase(input_tokens.begin());
247279
input_tokens.push_back(next_token);
280+
if (input_tokens.size() > max_input_length) {
281+
input_tokens.erase(input_tokens.begin());
282+
}
248283
}
249284
250285
std::cout << std::endl;
@@ -278,7 +313,9 @@ penalties for repeated tokens, and biases to prioritize or de-prioritize specifi
278313

279314
int main() {
280315
// Set up the prompt. This provides the seed text for the model to elaborate.
281-
std::string prompt = "Once upon a time, there was a";
316+
std::cout << "Prompt: ";
317+
std::string prompt;
318+
std::getline(std::cin, prompt);
282319

283320
// The tokenizer is used to convert between tokens (used by the model) and
284321
// human-readable strings.
@@ -290,19 +327,19 @@ int main() {
290327
// Load the exported nanoGPT program, which was generated via the previous steps.
291328
Module model("nanogpt.pte", torch::executor::Module::MlockConfig::UseMlockIgnoreErrors);
292329

330+
const auto max_input_tokens = 1024;
293331
const auto max_output_tokens = 30;
294332
std::cout << prompt;
295-
generate(model, prompt, tokenizer, sampler, max_output_tokens);
333+
generate(model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
296334
}
297335
```
298336

299337
Finally, download the following files into the same directory as main.h:
300338

301-
TODO: This is a placeholder.
302339
```
303-
curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/managed_tensor.h
304-
curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/basic_tokenizer.h
305-
curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/basic_sampler.h
340+
curl -O https://raw.githubusercontent.com/pytorch/executorch/release/stable/examples/llm_manual/basic_sampler.h
341+
curl -O https://raw.githubusercontent.com/pytorch/executorch/release/stable/examples/llm_manual/basic_tokenizer.h
342+
curl -O https://raw.githubusercontent.com/pytorch/executorch/release/stable/examples/llm_manual/managed_tensor.h
306343
```
307344

308345
To learn more, see [Running an ExecuTorch Model in C++](https://pytorch.org/executorch/main/running-a-model-cpp-tutorial.html)
@@ -363,10 +400,19 @@ cmake --build cmake-out -j10
363400
./cmake-out/nanogpt_runner
364401
```
365402

366-
You should see something like the following:
403+
You should see the instruction like the following to make you input the initial prompt:
404+
405+
```
406+
Prompt:
407+
```
408+
409+
Here we use "Hello world!" as example prompt. After you input your prompt and press enter:
367410

368411
```
369-
Once upon a time, there was a man who was a member of the military...
412+
Prompt: Hello world!
413+
Hello world!
414+
415+
I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
370416
```
371417

372418
At this point, it is likely to run very slowly. This is because ExecuTorch hasn't been told to optimize for
@@ -423,14 +469,24 @@ model = GPT.from_pretrained('gpt2')
423469
# Create example inputs. This is used in the export process to provide
424470
# hints on the expected shape of the model input.
425471
example_inputs = (
426-
torch.randint(0, 100, (1, 8), dtype=torch.long),
472+
torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long),
427473
)
428474

475+
# Set up dynamic shape configuration, which makes the input tensors'
476+
# sizes during the runtime does not need to match the size of tensors
477+
# in `example_inputs`, but follow the rule dynamic shape configuration shares.
478+
# Here we set the range of 0th model input's 1st dimension as [0, model.config.block_size - 1]
479+
# Detials of dynamic shape and how to create it customized can follow
480+
# [ExecuTorch Concept](https://pytorch.org/executorch/stable/concepts.html#dynamic-shapes)
481+
dynamic_shape = (
482+
{1: torch.export.Dim("token_dim", max=model.config.block_size - 1)},
483+
)
484+
429485
# Trace the model, converting it to a portable intermediate representation.
430486
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
431487
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
432-
m = capture_pre_autograd_graph(model, example_inputs)
433-
traced_model = export(m, example_inputs)
488+
m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes=dynamic_shape)
489+
traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)
434490

435491
# Convert the model into a runnable ExecuTorch program.
436492
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
@@ -512,12 +568,23 @@ cmake --build cmake-out -j10
512568
./cmake-out/nanogpt_runner
513569
```
514570

515-
You should see something like the following:
571+
572+
You should see the instruction like the following to make you input the initial prompt:
573+
574+
```
575+
Prompt:
576+
```
577+
578+
Here we use "Hello world!" as example prompt. After you input your prompt and press enter:
516579

517580
```
518-
Once upon a time, there was a man who was a member of the military...
581+
Prompt: Hello world!
582+
Hello world!
583+
584+
I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
519585
```
520586

587+
Now you'll be able to clearly feel the acceleration of the generation process, compare with no delegation.
521588

522589
For more information regarding backend delegateion, see the ExecuTorch guides
523590
for the

0 commit comments

Comments
 (0)