1
1
# Getting Started with LLMs via ExecuTorch
2
2
3
+ Welcome to LLM Manual! This manual is designed to provide a practical example to leverage
4
+ ExecuTorch in onboarding your own Large Language Models (LLMs). Our primary goal is to offer
5
+ a clear and concise guideline on how to integrate our system with your own LLMs.
6
+
7
+ Please note that this project is intended as a demonstration and not as a fully functional
8
+ example with optimal performance. As such, certain components such as the sampler, tokenizer,
9
+ and others are provided in their bare minimum versions solely for demonstration purposes.
10
+ Consequently, the results produced by the model may vary and might not always be optimal.
11
+
12
+ We encourage users to use this project as a starting point and adapt it to their specific needs,
13
+ which includes creating your own versions of the tokenizer, sampler, acceleration backends, and
14
+ other components. We hope this project serves as a useful guide in your journey with LLMs and ExecuTorch.
15
+
3
16
### Table Of Contents
4
17
5
18
@@ -141,13 +154,23 @@ model = GPT.from_pretrained('gpt2')
141
154
142
155
# Create example inputs. This is used in the export process to provide
143
156
# hints on the expected shape of the model input.
144
- example_inputs = (torch.randint(0 , 100 , (1 , 8 ), dtype = torch.long), )
157
+ example_inputs = (torch.randint(0 , 100 , (1 , model.config.block_size), dtype = torch.long), )
158
+
159
+ # Set up dynamic shape configuration, which makes the input tensors'
160
+ # sizes during the runtime does not need to match the size of tensors
161
+ # in `example_inputs`, but follow the rule dynamic shape configuration shares.
162
+ # Here we set the range of 0th model input's 1st dimension as [0, model.config.block_size - 1]
163
+ # Detials of dynamic shape and how to create it customized can follow
164
+ # [ExecuTorch Concept](https://pytorch.org/executorch/stable/concepts.html#dynamic-shapes)
165
+ dynamic_shape = (
166
+ {1 : torch.export.Dim(" token_dim" , max = model.config.block_size)},
167
+ )
145
168
146
169
# Trace the model, converting it to a portable intermediate representation.
147
170
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
148
171
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH ]), torch.no_grad():
149
- m = capture_pre_autograd_graph(model, example_inputs)
150
- traced_model = export(m, example_inputs)
172
+ m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes = dynamic_shape )
173
+ traced_model = export(m, example_inputs, dynamic_shapes = dynamic_shape )
151
174
152
175
# Convert the model into a runnable ExecuTorch program.
153
176
edge_config = EdgeCompileConfig(_check_ir_validity = False )
@@ -204,11 +227,14 @@ output token by token. Each generated token is passed as input for the next run.
204
227
```cpp
205
228
// main.cpp
206
229
230
+ #define ENDOFTEXT 50256
231
+
207
232
std::string generate(
208
233
Module& llm_model,
209
234
std::string& prompt,
210
235
BasicTokenizer& tokenizer,
211
236
BasicSampler& sampler,
237
+ size_t max_input_length,
212
238
size_t max_output_length) {
213
239
214
240
// Convert the input text into a list of integers (tokens) that represents
@@ -237,14 +263,23 @@ std::string generate(
237
263
238
264
// Sample the next token from the logits.
239
265
int64_t next_token = sampler.sample(logits);
266
+
267
+ // Break if we reached the end of the text.
268
+ if (next_token == ENDOFTEXT) {
269
+ break;
270
+ }
271
+
272
+ // Add the next token to the output.
240
273
output_tokens.push_back(next_token);
241
274
242
275
std::cout << tokenizer.decode({ next_token });
243
276
std::cout.flush();
244
277
245
278
// Update next input.
246
- input_tokens.erase(input_tokens.begin());
247
279
input_tokens.push_back(next_token);
280
+ if (input_tokens.size() > max_input_length) {
281
+ input_tokens.erase(input_tokens.begin());
282
+ }
248
283
}
249
284
250
285
std::cout << std::endl;
@@ -278,7 +313,9 @@ penalties for repeated tokens, and biases to prioritize or de-prioritize specifi
278
313
279
314
int main () {
280
315
// Set up the prompt. This provides the seed text for the model to elaborate.
281
- std::string prompt = "Once upon a time, there was a";
316
+ std::cout << "Prompt: ";
317
+ std::string prompt;
318
+ std::getline (std::cin, prompt);
282
319
283
320
// The tokenizer is used to convert between tokens (used by the model) and
284
321
// human-readable strings.
@@ -290,19 +327,19 @@ int main() {
290
327
// Load the exported nanoGPT program, which was generated via the previous steps.
291
328
Module model("nanogpt.pte", torch::executor::Module::MlockConfig::UseMlockIgnoreErrors);
292
329
330
+ const auto max_input_tokens = 1024;
293
331
const auto max_output_tokens = 30;
294
332
std::cout << prompt;
295
- generate (model, prompt, tokenizer, sampler, max_output_tokens);
333
+ generate (model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
296
334
}
297
335
```
298
336
299
337
Finally, download the following files into the same directory as main.h:
300
338
301
- TODO: This is a placeholder.
302
339
```
303
- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/managed_tensor .h
304
- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt /basic_tokenizer.h
305
- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/basic_sampler .h
340
+ curl -O https://raw.githubusercontent.com/pytorch/executorch/release/stable/examples/llm_manual/basic_sampler .h
341
+ curl -O https://raw.githubusercontent.com/pytorch/executorch/release/stable/examples/llm_manual /basic_tokenizer.h
342
+ curl -O https://raw.githubusercontent.com/pytorch/executorch/release/stable/examples/llm_manual/managed_tensor .h
306
343
```
307
344
308
345
To learn more, see [ Running an ExecuTorch Model in C++] ( https://pytorch.org/executorch/main/running-a-model-cpp-tutorial.html )
@@ -363,10 +400,19 @@ cmake --build cmake-out -j10
363
400
./cmake-out/nanogpt_runner
364
401
```
365
402
366
- You should see something like the following:
403
+ You should see the instruction like the following to make you input the initial prompt:
404
+
405
+ ```
406
+ Prompt:
407
+ ```
408
+
409
+ Here we use "Hello world!" as example prompt. After you input your prompt and press enter:
367
410
368
411
```
369
- Once upon a time, there was a man who was a member of the military...
412
+ Prompt: Hello world!
413
+ Hello world!
414
+
415
+ I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
370
416
```
371
417
372
418
At this point, it is likely to run very slowly. This is because ExecuTorch hasn't been told to optimize for
@@ -423,14 +469,24 @@ model = GPT.from_pretrained('gpt2')
423
469
# Create example inputs. This is used in the export process to provide
424
470
# hints on the expected shape of the model input.
425
471
example_inputs = (
426
- torch.randint(0 , 100 , (1 , 8 ), dtype = torch.long),
472
+ torch.randint(0 , 100 , (1 , model.config.block_size - 1 ), dtype = torch.long),
427
473
)
428
474
475
+ # Set up dynamic shape configuration, which makes the input tensors'
476
+ # sizes during the runtime does not need to match the size of tensors
477
+ # in `example_inputs`, but follow the rule dynamic shape configuration shares.
478
+ # Here we set the range of 0th model input's 1st dimension as [0, model.config.block_size - 1]
479
+ # Detials of dynamic shape and how to create it customized can follow
480
+ # [ExecuTorch Concept](https://pytorch.org/executorch/stable/concepts.html#dynamic-shapes)
481
+ dynamic_shape = (
482
+ {1 : torch.export.Dim(" token_dim" , max = model.config.block_size - 1 )},
483
+ )
484
+
429
485
# Trace the model, converting it to a portable intermediate representation.
430
486
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
431
487
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH ]), torch.no_grad():
432
- m = capture_pre_autograd_graph(model, example_inputs)
433
- traced_model = export(m, example_inputs)
488
+ m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes = dynamic_shape )
489
+ traced_model = export(m, example_inputs, dynamic_shapes = dynamic_shape )
434
490
435
491
# Convert the model into a runnable ExecuTorch program.
436
492
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
@@ -512,12 +568,23 @@ cmake --build cmake-out -j10
512
568
./cmake-out/nanogpt_runner
513
569
```
514
570
515
- You should see something like the following:
571
+
572
+ You should see the instruction like the following to make you input the initial prompt:
573
+
574
+ ```
575
+ Prompt:
576
+ ```
577
+
578
+ Here we use "Hello world!" as example prompt. After you input your prompt and press enter:
516
579
517
580
```
518
- Once upon a time, there was a man who was a member of the military...
581
+ Prompt: Hello world!
582
+ Hello world!
583
+
584
+ I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
519
585
```
520
586
587
+ Now you'll be able to clearly feel the acceleration of the generation process, compare with no delegation.
521
588
522
589
For more information regarding backend delegateion, see the ExecuTorch guides
523
590
for the
0 commit comments