Skip to content

TensorFlow 1.12 and Horovod support #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jan 8, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
50917c9
Remove model folder
mvsusp Dec 2, 2018
5f677a1
Add benchmarks as submodule
mvsusp Dec 2, 2018
4feea20
Update benchmarking script
mvsusp Dec 3, 2018
8974069
Update benchmark scripts
mvsusp Dec 5, 2018
7893f99
Update benchmarks
mvsusp Dec 5, 2018
01b93e3
Merge branch 'script-mode' into mvs-update-benchmarks
mvsusp Dec 5, 2018
0a63299
Remove psutil
mvsusp Dec 16, 2018
d5b8c72
Merge branch 'mvs-update-benchmarks' of github.com:mvsusp/sagemaker-t…
mvsusp Dec 16, 2018
68a622d
Create Dockerfiles
mvsusp Dec 16, 2018
0d275ff
WIP
mvsusp Dec 17, 2018
1e53d7c
WIP
Dec 18, 2018
432869a
Wip
mvsusp Dec 18, 2018
cdb4a1c
WIP
mvsusp Dec 18, 2018
9e383b6
Wip
mvsusp Dec 18, 2018
46a2cbd
WIP
mvsusp Dec 18, 2018
cf7c187
Updated dockerfiles
Dec 18, 2018
1bc3ec9
Merge branch 'mvs-update-benchmarks' into mvs-hvd
mvsusp Dec 18, 2018
161ba31
WIP
mvsusp Dec 18, 2018
3c02aea
Update sagemaker-containers
mvsusp Dec 20, 2018
dbf0260
Merge branch 'mvs-hvd' of github.com:mvsusp/sagemaker-tensorflow-cont…
mvsusp Dec 20, 2018
10e687b
Integ tests
mvsusp Dec 20, 2018
02a9ee4
Test fix
mvsusp Dec 20, 2018
d11e8bd
Merge branch 'script-mode' into mvs-hvd
mvsusp Dec 21, 2018
9742909
Fix tests
mvsusp Dec 21, 2018
4cede7f
Test fix
mvsusp Dec 21, 2018
4124872
Remove git submodule
mvsusp Dec 21, 2018
f1f5e5f
Merge branch 'script-mode' into mvs-hvd
nadiaya Dec 21, 2018
18651fe
Merge remote-tracking branch 'origin/script-mode' into mvs-hvd
mvsusp Jan 7, 2019
3b5db0e
Fix test
mvsusp Jan 7, 2019
1d869e0
Test only CPU
mvsusp Jan 7, 2019
42cba39
Skip GPU
mvsusp Jan 7, 2019
ff81c2e
Remove unused line
mvsusp Jan 7, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 50 additions & 19 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,64 @@
This folder contains the TF training scripts https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks.

## Basic usage
**execute_tensorflow_training.py** uses SageMaker python sdk to start a training job. It takes the following parameters:
**execute_tensorflow_training.py train** uses SageMaker python sdk to start a training job.

- role: SageMaker role used for training
- region: SageMaker region
- py-versions: py2 or py3 or "py2, py3"
- instance-types: A list of SageMaker instance types, for example 'ml.p2.xlarge, ml.c4.xlarge'. Use 'local' for local mode training.
- checkpoint-path: The S3 location where the model checkpoints and tensorboard events are saved after training
```bash
./execute_tensorflow_training.py train --help
Usage: execute_tensorflow_training.py train [OPTIONS] [SCRIPT_ARGS]...

Options:
--framework-version [1.11.0|1.12.0]
[required]
--device [cpu|gpu] [required]
--py-versions TEXT
--training-input-mode [File|Pipe]
--networking-isolation / --no-networking-isolation
--wait / --no-wait
--security-groups TEXT
--subnets TEXT
--role TEXT
--instance-counts INTEGER
--batch-sizes INTEGER
--instance-types TEXT
--help Show this message and exit.

Any unknown arguments will be passed to the training script as additional arguments.
```
**execute_tensorflow_training.py generate_reports** generate benchmark reports.

## Examples:

```bash
./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type local --num_epochs 1 --wait

./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type ml.c4.xlarge, ml.c5.xlarge --model resnet50
#!/usr/bin/env bash

./execute_tensorflow_training.py train \
--framework-version 1.11.0 \
--device gpu \
\
--instance-types ml.p3.2xlarge \
--instance-types ml.p3.8xlarge \
--instance-types ml.p3.16xlarge \
--instance-types ml.p2.xlarge \
--instance-types ml.p2.8xlarge \
--instance-types ml.p2.16xlarge \
\
--instance-counts 1 \
\
--py-versions py3 \
--py-versions py2 \
\
--subnets subnet-125fb674 \
\
--security-groups sg-ce5dd1b4 \
\
--batch-sizes 32 \
--batch-sizes 64 \
--batch-sizes 128 \
--batch-sizes 256 \
--batch-sizes 512 \
\
-- --model resnet32 --num_epochs 10 --data_format NHWC --summary_verbosity 1 --save_summaries_steps 10 --data_name cifar10
```

## Using other models, datasets and benchmarks configurations
```python tf_cnn_benchmarks/tf_cnn_benchmarks.py --help``` shows all the options that the script has.


## Tensorboard events and checkpoints

Tensorboard events are being saved to the S3 location defined by the hyperparameter checkpoint_path during training. That location can be overwritten by setting the script argument ```checkpoint-path```:

```bash
python execute_tensorflow_training.py ... --checkpoint-path s3://my/bucket/output/data
```
1 change: 1 addition & 0 deletions benchmarks/benchmarks
Submodule benchmarks added at ec056b
Loading