Skip to content

Pr 138 fix #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 59 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
50917c9
Remove model folder
mvsusp Dec 2, 2018
5f677a1
Add benchmarks as submodule
mvsusp Dec 2, 2018
4feea20
Update benchmarking script
mvsusp Dec 3, 2018
8974069
Update benchmark scripts
mvsusp Dec 5, 2018
7893f99
Update benchmarks
mvsusp Dec 5, 2018
01b93e3
Merge branch 'script-mode' into mvs-update-benchmarks
mvsusp Dec 5, 2018
0a63299
Remove psutil
mvsusp Dec 16, 2018
d5b8c72
Merge branch 'mvs-update-benchmarks' of github.com:mvsusp/sagemaker-t…
mvsusp Dec 16, 2018
68a622d
Create Dockerfiles
mvsusp Dec 16, 2018
0d275ff
WIP
mvsusp Dec 17, 2018
1e53d7c
WIP
Dec 18, 2018
432869a
Wip
mvsusp Dec 18, 2018
cdb4a1c
WIP
mvsusp Dec 18, 2018
9e383b6
Wip
mvsusp Dec 18, 2018
46a2cbd
WIP
mvsusp Dec 18, 2018
cf7c187
Updated dockerfiles
Dec 18, 2018
1bc3ec9
Merge branch 'mvs-update-benchmarks' into mvs-hvd
mvsusp Dec 18, 2018
161ba31
WIP
mvsusp Dec 18, 2018
3c02aea
Update sagemaker-containers
mvsusp Dec 20, 2018
dbf0260
Merge branch 'mvs-hvd' of github.com:mvsusp/sagemaker-tensorflow-cont…
mvsusp Dec 20, 2018
10e687b
Integ tests
mvsusp Dec 20, 2018
02a9ee4
Test fix
mvsusp Dec 20, 2018
d11e8bd
Merge branch 'script-mode' into mvs-hvd
mvsusp Dec 21, 2018
9742909
Fix tests
mvsusp Dec 21, 2018
4cede7f
Test fix
mvsusp Dec 21, 2018
4124872
Remove git submodule
mvsusp Dec 21, 2018
f1f5e5f
Merge branch 'script-mode' into mvs-hvd
nadiaya Dec 21, 2018
7ff6454
Changing the num of process per host from 3 to 2 as only 2 cpus are a…
uditbhatia Jan 3, 2019
136e112
Removing 5,3 test cases
uditbhatia Jan 3, 2019
527d17a
Creating docker subfolder
uditbhatia Jan 3, 2019
bbbf1f6
Remove model folder
mvsusp Dec 2, 2018
194c875
Add benchmarks as submodule
mvsusp Dec 2, 2018
29c4b1f
Update benchmarking script
mvsusp Dec 3, 2018
57c4f6d
Update benchmark scripts
mvsusp Dec 5, 2018
44be1a0
Update benchmarks
mvsusp Dec 5, 2018
fd27be6
Remove psutil
mvsusp Dec 16, 2018
fe4acdd
Create Dockerfiles
mvsusp Dec 16, 2018
4a43487
WIP
mvsusp Dec 17, 2018
382f109
WIP
Dec 18, 2018
67eb3e8
Wip
mvsusp Dec 18, 2018
dcf6622
WIP
mvsusp Dec 18, 2018
9125237
Wip
mvsusp Dec 18, 2018
ea7a675
WIP
mvsusp Dec 18, 2018
56afdde
Updated dockerfiles
Dec 18, 2018
afb911d
WIP
mvsusp Dec 18, 2018
de6bd2e
Update sagemaker-containers
mvsusp Dec 20, 2018
86cb4e5
Integ tests
mvsusp Dec 20, 2018
01e1523
Test fix
mvsusp Dec 20, 2018
0ef6d1d
Fix tests
mvsusp Dec 21, 2018
aaabb02
Test fix
mvsusp Dec 21, 2018
30508d5
Remove git submodule
mvsusp Dec 21, 2018
e002a12
Changing the num of process per host from 3 to 2 as only 2 cpus are a…
uditbhatia Jan 3, 2019
1f42f99
Removing 5,3 test cases
uditbhatia Jan 3, 2019
561447d
Creating docker subfolder
uditbhatia Jan 3, 2019
71a210e
Merge branch 'pr-138-fix' of github.com:uditbhatia/sagemaker-tensorfl…
uditbhatia Jan 4, 2019
885d370
Adding missing imprt runner
uditbhatia Jan 4, 2019
542f77c
reorganizing im,ports
uditbhatia Jan 4, 2019
94fc94f
Skipping failing horovod integ test
uditbhatia Jan 4, 2019
c8c7f29
organizing imports
uditbhatia Jan 4, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 50 additions & 19 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,64 @@
This folder contains the TF training scripts https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks.

## Basic usage
**execute_tensorflow_training.py** uses SageMaker python sdk to start a training job. It takes the following parameters:
**execute_tensorflow_training.py train** uses SageMaker python sdk to start a training job.

- role: SageMaker role used for training
- region: SageMaker region
- py-versions: py2 or py3 or "py2, py3"
- instance-types: A list of SageMaker instance types, for example 'ml.p2.xlarge, ml.c4.xlarge'. Use 'local' for local mode training.
- checkpoint-path: The S3 location where the model checkpoints and tensorboard events are saved after training
```bash
./execute_tensorflow_training.py train --help
Usage: execute_tensorflow_training.py train [OPTIONS] [SCRIPT_ARGS]...

Options:
--framework-version [1.11.0|1.12.0]
[required]
--device [cpu|gpu] [required]
--py-versions TEXT
--training-input-mode [File|Pipe]
--networking-isolation / --no-networking-isolation
--wait / --no-wait
--security-groups TEXT
--subnets TEXT
--role TEXT
--instance-counts INTEGER
--batch-sizes INTEGER
--instance-types TEXT
--help Show this message and exit.

Any unknown arguments will be passed to the training script as additional arguments.
```
**execute_tensorflow_training.py generate_reports** generate benchmark reports.

## Examples:

```bash
./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type local --num_epochs 1 --wait

./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type ml.c4.xlarge, ml.c5.xlarge --model resnet50
#!/usr/bin/env bash

./execute_tensorflow_training.py train \
--framework-version 1.11.0 \
--device gpu \
\
--instance-types ml.p3.2xlarge \
--instance-types ml.p3.8xlarge \
--instance-types ml.p3.16xlarge \
--instance-types ml.p2.xlarge \
--instance-types ml.p2.8xlarge \
--instance-types ml.p2.16xlarge \
\
--instance-counts 1 \
\
--py-versions py3 \
--py-versions py2 \
\
--subnets subnet-125fb674 \
\
--security-groups sg-ce5dd1b4 \
\
--batch-sizes 32 \
--batch-sizes 64 \
--batch-sizes 128 \
--batch-sizes 256 \
--batch-sizes 512 \
\
-- --model resnet32 --num_epochs 10 --data_format NHWC --summary_verbosity 1 --save_summaries_steps 10 --data_name cifar10
```

## Using other models, datasets and benchmarks configurations
```python tf_cnn_benchmarks/tf_cnn_benchmarks.py --help``` shows all the options that the script has.


## Tensorboard events and checkpoints

Tensorboard events are being saved to the S3 location defined by the hyperparameter checkpoint_path during training. That location can be overwritten by setting the script argument ```checkpoint-path```:

```bash
python execute_tensorflow_training.py ... --checkpoint-path s3://my/bucket/output/data
```
1 change: 1 addition & 0 deletions benchmarks/benchmarks
Submodule benchmarks added at ec056b
Loading