Skip to content

Fir for Pr 138 by Udit Bhatia #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
c93d034
Scriptmode single machine training implementation (#78)
icywang86rui Sep 27, 2018
3763697
Add tox.ini and configure coverage and flake runs (#80)
icywang86rui Oct 2, 2018
99eaf6b
Add integration tests to run training jobs with sagemaker (#81)
icywang86rui Oct 5, 2018
1338820
Add Script Mode example (#83)
mvsusp Oct 9, 2018
a1916a8
Add benchmarking script (#86)
mvsusp Oct 23, 2018
7047101
Edited the tf script mode notebook (#90)
eslesar-aws Oct 27, 2018
032cf60
Add distributed training support (#98)
icywang86rui Nov 6, 2018
177773d
Add CI configuration files (#109)
mvsusp Nov 15, 2018
a897135
Set S3 environment variables (#112)
icywang86rui Nov 16, 2018
5913b17
GPU fix (#117)
mvsusp Nov 19, 2018
1fab499
Update sagemaker containers (#119)
mvsusp Nov 19, 2018
c4abcae
Set parameter process waiting to False (#120)
mvsusp Nov 20, 2018
378add5
Disable GPU for parameter process (#121)
icywang86rui Nov 21, 2018
534ffa7
Unset CUDA_VISIBLE_DEVICES for worker processes (#122)
icywang86rui Nov 21, 2018
e6bf988
Fix broken unit tests (#124)
icywang86rui Nov 23, 2018
49a0547
Add Keras support (#126)
mvsusp Nov 24, 2018
962f15b
Create parameter server in different thread (#129)
icywang86rui Nov 27, 2018
50917c9
Remove model folder
mvsusp Dec 2, 2018
5f677a1
Add benchmarks as submodule
mvsusp Dec 2, 2018
4feea20
Update benchmarking script
mvsusp Dec 3, 2018
8e6c4f2
Fix Keras test (#132)
icywang86rui Dec 4, 2018
8974069
Update benchmark scripts
mvsusp Dec 5, 2018
7893f99
Update benchmarks
mvsusp Dec 5, 2018
01b93e3
Merge branch 'script-mode' into mvs-update-benchmarks
mvsusp Dec 5, 2018
d2f9f48
Skip keras local mode test on gpu and use random port for serving in …
icywang86rui Dec 5, 2018
0a63299
Remove psutil
mvsusp Dec 16, 2018
d5b8c72
Merge branch 'mvs-update-benchmarks' of github.com:mvsusp/sagemaker-t…
mvsusp Dec 16, 2018
68a622d
Create Dockerfiles
mvsusp Dec 16, 2018
0d275ff
WIP
mvsusp Dec 17, 2018
1e53d7c
WIP
Dec 18, 2018
432869a
Wip
mvsusp Dec 18, 2018
cdb4a1c
WIP
mvsusp Dec 18, 2018
9e383b6
Wip
mvsusp Dec 18, 2018
46a2cbd
WIP
mvsusp Dec 18, 2018
cf7c187
Updated dockerfiles
Dec 18, 2018
1bc3ec9
Merge branch 'mvs-update-benchmarks' into mvs-hvd
mvsusp Dec 18, 2018
161ba31
WIP
mvsusp Dec 18, 2018
3c02aea
Update sagemaker-containers
mvsusp Dec 20, 2018
dbf0260
Merge branch 'mvs-hvd' of github.com:mvsusp/sagemaker-tensorflow-cont…
mvsusp Dec 20, 2018
10e687b
Integ tests
mvsusp Dec 20, 2018
02a9ee4
Test fix
mvsusp Dec 20, 2018
80aa735
Update script_mode_train_any_tf_script_in_sage_maker.ipynb (#110)
mvsusp Dec 21, 2018
d11e8bd
Merge branch 'script-mode' into mvs-hvd
mvsusp Dec 21, 2018
9742909
Fix tests
mvsusp Dec 21, 2018
4cede7f
Test fix
mvsusp Dec 21, 2018
4124872
Remove git submodule
mvsusp Dec 21, 2018
441adb0
Add python-dev and build-essential to Dockerfiles (#141)
laurenyu Dec 21, 2018
f1f5e5f
Merge branch 'script-mode' into mvs-hvd
nadiaya Dec 21, 2018
7ff6454
Changing the num of process per host from 3 to 2 as only 2 cpus are a…
uditbhatia Jan 3, 2019
136e112
Removing 5,3 test cases
uditbhatia Jan 3, 2019
527d17a
Creating docker subfolder
uditbhatia Jan 3, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .coveragerc_py27
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[run]
branch = True
timid = True

[report]
exclude_lines =
pragma: no cover
pragma: py2 no cover
if six.PY3
elif six.PY3

partial_branches =
pragma: no cover
pragma: py2 no cover
if six.PY3
elif six.PY3

show_missing = True

fail_under = 75
20 changes: 20 additions & 0 deletions .coveragerc_py36
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[run]
branch = True
timid = True

[report]
exclude_lines =
pragma: no cover
pragma: py3 no cover
if six.PY2
elif six.PY2

partial_branches =
pragma: no cover
pragma: py3 no cover
if six.PY3
elif six.PY3

show_missing = True

fail_under = 90
3 changes: 3 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[flake8]
application_import_names = sagemaker_tensorflow_container, test
import-order-style = google
16 changes: 8 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ The Docker files are grouped based on TensorFlow version and separated
based on Python version and processor type.

The Docker images, used to run training & inference jobs, are built from
both corresponding base and final Dockerfiles.
both corresponding "base" and "final" Dockerfiles.

Base Images
~~~~~~~~~~~
Expand All @@ -66,10 +66,10 @@ The "base" Dockerfile encompass the installation of the framework and all of the
needed. It is needed before building image for TensorFlow 1.8.0 and before.
Building a base image is not required for images for TensorFlow 1.9.0 and onwards.

Tagging scheme is based on <tensorflow_version>-<processor>-<python_version>. (e.g. 1.4
Tagging scheme is based on <tensorflow_version>-<processor>-<python_version>. (e.g. 1.4
.1-cpu-py2)

All final Dockerfiles build images using base images that use the tagging scheme
All "final" Dockerfiles build images using base images that use the tagging scheme
above.

If you want to build your "base" Docker image, then use:
Expand Down Expand Up @@ -99,15 +99,15 @@ Final Images

The "final" Dockerfiles encompass the installation of the SageMaker specific support code.

For images of TensorFlow 1.8.0 and before, all final Dockerfiles use `base images for building <https://github
For images of TensorFlow 1.8.0 and before, all "final" Dockerfiles use `base images for building <https://github
.com/aws/sagemaker-tensorflow-containers/blob/master/docker/1.4.1/final/py2/Dockerfile.cpu#L2>`__.

These base images are specified with the naming convention of
These "base" images are specified with the naming convention of
tensorflow-base:<tensorflow_version>-<processor>-<python_version>.

Before building final images:
Before building "final" images:

Build your base image. Make sure it is named and tagged in accordance with your final
Build your "base" image. Make sure it is named and tagged in accordance with your "final"
Dockerfile. Skip this step if you want to build image of Tensorflow Version 1.9.0 and above.

Then prepare the SageMaker TensorFlow Container python package in the image folder like below:
Expand All @@ -118,7 +118,7 @@ Then prepare the SageMaker TensorFlow Container python package in the image fold
cd sagemaker-tensorflow-containers
python setup.py sdist

#. Copy your Python package to final Dockerfile directory that you are building.
#. Copy your Python package to "final" Dockerfile directory that you are building.
cp dist/sagemaker_tensorflow_container-<package_version>.tar.gz docker/<tensorflow_version>/final/py2

If you want to build "final" Docker images, for versions 1.6 and above, you will first need to download the appropriate tensorflow pip wheel, then pass in its location as a build argument. These can be obtained from pypi. For example, the files for 1.6.0 are here:
Expand Down
66 changes: 66 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# TensorFlow benchmarking scripts

This folder contains the TF training scripts https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks.

## Basic usage
**execute_tensorflow_training.py train** uses SageMaker python sdk to start a training job.

```bash
./execute_tensorflow_training.py train --help
Usage: execute_tensorflow_training.py train [OPTIONS] [SCRIPT_ARGS]...

Options:
--framework-version [1.11.0|1.12.0]
[required]
--device [cpu|gpu] [required]
--py-versions TEXT
--training-input-mode [File|Pipe]
--networking-isolation / --no-networking-isolation
--wait / --no-wait
--security-groups TEXT
--subnets TEXT
--role TEXT
--instance-counts INTEGER
--batch-sizes INTEGER
--instance-types TEXT
--help Show this message and exit.

```
**execute_tensorflow_training.py generate_reports** generate benchmark reports.

## Examples:

```bash
#!/usr/bin/env bash

./execute_tensorflow_training.py train \
--framework-version 1.11.0 \
--device gpu \
\
--instance-types ml.p3.2xlarge \
--instance-types ml.p3.8xlarge \
--instance-types ml.p3.16xlarge \
--instance-types ml.p2.xlarge \
--instance-types ml.p2.8xlarge \
--instance-types ml.p2.16xlarge \
\
--instance-counts 1 \
\
--py-versions py3 \
--py-versions py2 \
\
--subnets subnet-125fb674 \
\
--security-groups sg-ce5dd1b4 \
\
--batch-sizes 32 \
--batch-sizes 64 \
--batch-sizes 128 \
--batch-sizes 256 \
--batch-sizes 512 \
\
-- --model resnet32 --num_epochs 10 --data_format NHWC --summary_verbosity 1 --save_summaries_steps 10 --data_name cifar10
```

## Using other models, datasets and benchmarks configurations
```python tf_cnn_benchmarks/tf_cnn_benchmarks.py --help``` shows all the options that the script has.
1 change: 1 addition & 0 deletions benchmarks/benchmarks
Submodule benchmarks added at ec056b
Loading