Skip to content

Commit 4f66042

Browse files
authored
TensorFlow 1.12 and Horovod support (#138)
1 parent a4e6cfa commit 4f66042

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+800
-11520
lines changed

benchmarks/README.md

Lines changed: 50 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,33 +3,64 @@
33
This folder contains the TF training scripts https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks.
44

55
## Basic usage
6-
**execute_tensorflow_training.py** uses SageMaker python sdk to start a training job. It takes the following parameters:
6+
**execute_tensorflow_training.py train** uses SageMaker python sdk to start a training job.
77

8-
- role: SageMaker role used for training
9-
- region: SageMaker region
10-
- py-versions: py2 or py3 or "py2, py3"
11-
- instance-types: A list of SageMaker instance types, for example 'ml.p2.xlarge, ml.c4.xlarge'. Use 'local' for local mode training.
12-
- checkpoint-path: The S3 location where the model checkpoints and tensorboard events are saved after training
8+
```bash
9+
./execute_tensorflow_training.py train --help
10+
Usage: execute_tensorflow_training.py train [OPTIONS] [SCRIPT_ARGS]...
11+
12+
Options:
13+
--framework-version [1.11.0|1.12.0]
14+
[required]
15+
--device [cpu|gpu] [required]
16+
--py-versions TEXT
17+
--training-input-mode [File|Pipe]
18+
--networking-isolation / --no-networking-isolation
19+
--wait / --no-wait
20+
--security-groups TEXT
21+
--subnets TEXT
22+
--role TEXT
23+
--instance-counts INTEGER
24+
--batch-sizes INTEGER
25+
--instance-types TEXT
26+
--help Show this message and exit.
1327

14-
Any unknown arguments will be passed to the training script as additional arguments.
28+
```
29+
**execute_tensorflow_training.py generate_reports** generate benchmark reports.
1530

1631
## Examples:
1732

1833
```bash
19-
./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type local --num_epochs 1 --wait
20-
21-
./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type ml.c4.xlarge, ml.c5.xlarge --model resnet50
34+
#!/usr/bin/env bash
2235

36+
./execute_tensorflow_training.py train \
37+
--framework-version 1.11.0 \
38+
--device gpu \
39+
\
40+
--instance-types ml.p3.2xlarge \
41+
--instance-types ml.p3.8xlarge \
42+
--instance-types ml.p3.16xlarge \
43+
--instance-types ml.p2.xlarge \
44+
--instance-types ml.p2.8xlarge \
45+
--instance-types ml.p2.16xlarge \
46+
\
47+
--instance-counts 1 \
48+
\
49+
--py-versions py3 \
50+
--py-versions py2 \
51+
\
52+
--subnets subnet-125fb674 \
53+
\
54+
--security-groups sg-ce5dd1b4 \
55+
\
56+
--batch-sizes 32 \
57+
--batch-sizes 64 \
58+
--batch-sizes 128 \
59+
--batch-sizes 256 \
60+
--batch-sizes 512 \
61+
\
62+
-- --model resnet32 --num_epochs 10 --data_format NHWC --summary_verbosity 1 --save_summaries_steps 10 --data_name cifar10
2363
```
2464

2565
## Using other models, datasets and benchmarks configurations
2666
```python tf_cnn_benchmarks/tf_cnn_benchmarks.py --help``` shows all the options that the script has.
27-
28-
29-
## Tensorboard events and checkpoints
30-
31-
Tensorboard events are being saved to the S3 location defined by the hyperparameter checkpoint_path during training. That location can be overwritten by setting the script argument ```checkpoint-path```:
32-
33-
```bash
34-
python execute_tensorflow_training.py ... --checkpoint-path s3://my/bucket/output/data
35-
```

benchmarks/benchmarks

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Subproject commit ec056be57f189ec96611a58e8dc5562a6d620139

0 commit comments

Comments
 (0)