|
3 | 3 | This folder contains the TF training scripts https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks.
|
4 | 4 |
|
5 | 5 | ## Basic usage
|
6 |
| -**execute_tensorflow_training.py** uses SageMaker python sdk to start a training job. It takes the following parameters: |
| 6 | +**execute_tensorflow_training.py train** uses SageMaker python sdk to start a training job. |
7 | 7 |
|
8 |
| -- role: SageMaker role used for training |
9 |
| -- region: SageMaker region |
10 |
| -- py-versions: py2 or py3 or "py2, py3" |
11 |
| -- instance-types: A list of SageMaker instance types, for example 'ml.p2.xlarge, ml.c4.xlarge'. Use 'local' for local mode training. |
12 |
| -- checkpoint-path: The S3 location where the model checkpoints and tensorboard events are saved after training |
| 8 | +```bash |
| 9 | +./execute_tensorflow_training.py train --help |
| 10 | +Usage: execute_tensorflow_training.py train [OPTIONS] [SCRIPT_ARGS]... |
| 11 | + |
| 12 | +Options: |
| 13 | + --framework-version [1.11.0|1.12.0] |
| 14 | + [required] |
| 15 | + --device [cpu|gpu] [required] |
| 16 | + --py-versions TEXT |
| 17 | + --training-input-mode [File|Pipe] |
| 18 | + --networking-isolation / --no-networking-isolation |
| 19 | + --wait / --no-wait |
| 20 | + --security-groups TEXT |
| 21 | + --subnets TEXT |
| 22 | + --role TEXT |
| 23 | + --instance-counts INTEGER |
| 24 | + --batch-sizes INTEGER |
| 25 | + --instance-types TEXT |
| 26 | + --help Show this message and exit. |
13 | 27 |
|
14 |
| -Any unknown arguments will be passed to the training script as additional arguments. |
| 28 | +``` |
| 29 | +**execute_tensorflow_training.py generate_reports** generate benchmark reports. |
15 | 30 |
|
16 | 31 | ## Examples:
|
17 | 32 |
|
18 | 33 | ```bash
|
19 |
| -./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type local --num_epochs 1 --wait |
20 |
| - |
21 |
| -./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type ml.c4.xlarge, ml.c5.xlarge --model resnet50 |
| 34 | +#!/usr/bin/env bash |
22 | 35 |
|
| 36 | +./execute_tensorflow_training.py train \ |
| 37 | +--framework-version 1.11.0 \ |
| 38 | +--device gpu \ |
| 39 | +\ |
| 40 | +--instance-types ml.p3.2xlarge \ |
| 41 | +--instance-types ml.p3.8xlarge \ |
| 42 | +--instance-types ml.p3.16xlarge \ |
| 43 | +--instance-types ml.p2.xlarge \ |
| 44 | +--instance-types ml.p2.8xlarge \ |
| 45 | +--instance-types ml.p2.16xlarge \ |
| 46 | +\ |
| 47 | +--instance-counts 1 \ |
| 48 | +\ |
| 49 | +--py-versions py3 \ |
| 50 | +--py-versions py2 \ |
| 51 | +\ |
| 52 | +--subnets subnet-125fb674 \ |
| 53 | +\ |
| 54 | +--security-groups sg-ce5dd1b4 \ |
| 55 | +\ |
| 56 | +--batch-sizes 32 \ |
| 57 | +--batch-sizes 64 \ |
| 58 | +--batch-sizes 128 \ |
| 59 | +--batch-sizes 256 \ |
| 60 | +--batch-sizes 512 \ |
| 61 | +\ |
| 62 | +-- --model resnet32 --num_epochs 10 --data_format NHWC --summary_verbosity 1 --save_summaries_steps 10 --data_name cifar10 |
23 | 63 | ```
|
24 | 64 |
|
25 | 65 | ## Using other models, datasets and benchmarks configurations
|
26 | 66 | ```python tf_cnn_benchmarks/tf_cnn_benchmarks.py --help``` shows all the options that the script has.
|
27 |
| - |
28 |
| - |
29 |
| -## Tensorboard events and checkpoints |
30 |
| - |
31 |
| -Tensorboard events are being saved to the S3 location defined by the hyperparameter checkpoint_path during training. That location can be overwritten by setting the script argument ```checkpoint-path```: |
32 |
| - |
33 |
| -```bash |
34 |
| -python execute_tensorflow_training.py ... --checkpoint-path s3://my/bucket/output/data |
35 |
| -``` |
|
0 commit comments