Skip to content

Commit c93d034

Browse files
authored
Scriptmode single machine training implementation (#78)
* Scriptmode with cpu docker py2 and py3 docker file * Migrate to sagemaker-containers 2.1 * Remove serving related packages and code from container * Add py3 container * Add integ and unit tests for script mode * Remove non-asci characters from README * Changes based on pr comments * Move conftest to test root dir * Add default values for test args * add docker-compose to test requirement
1 parent 8acf51d commit c93d034

File tree

107 files changed

+383
-7452
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

107 files changed

+383
-7452
lines changed

README.rst

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ The Docker files are grouped based on TensorFlow version and separated
5757
based on Python version and processor type.
5858

5959
The Docker images, used to run training & inference jobs, are built from
60-
both corresponding base and final Dockerfiles.
60+
both corresponding "base" and "final" Dockerfiles.
6161

6262
Base Images
6363
~~~~~~~~~~~
@@ -66,10 +66,10 @@ The "base" Dockerfile encompass the installation of the framework and all of the
6666
needed. It is needed before building image for TensorFlow 1.8.0 and before.
6767
Building a base image is not required for images for TensorFlow 1.9.0 and onwards.
6868

69-
Tagging scheme is based on <tensorflow_version>-<processor>-<python_version>. (e.g. 1.4
69+
Tagging scheme is based on <tensorflow_version>-<processor>-<python_version>. (e.g. 1.4
7070
.1-cpu-py2)
7171

72-
All final Dockerfiles build images using base images that use the tagging scheme
72+
All "final" Dockerfiles build images using base images that use the tagging scheme
7373
above.
7474

7575
If you want to build your "base" Docker image, then use:
@@ -99,15 +99,15 @@ Final Images
9999

100100
The "final" Dockerfiles encompass the installation of the SageMaker specific support code.
101101

102-
For images of TensorFlow 1.8.0 and before, all final Dockerfiles use `base images for building <https://github
102+
For images of TensorFlow 1.8.0 and before, all "final" Dockerfiles use `base images for building <https://github
103103
.com/aws/sagemaker-tensorflow-containers/blob/master/docker/1.4.1/final/py2/Dockerfile.cpu#L2>`__.
104104

105-
These base images are specified with the naming convention of
105+
These "base" images are specified with the naming convention of
106106
tensorflow-base:<tensorflow_version>-<processor>-<python_version>.
107107

108-
Before building final images:
108+
Before building "final" images:
109109

110-
Build your base image. Make sure it is named and tagged in accordance with your final
110+
Build your "base" image. Make sure it is named and tagged in accordance with your "final"
111111
Dockerfile. Skip this step if you want to build image of Tensorflow Version 1.9.0 and above.
112112

113113
Then prepare the SageMaker TensorFlow Container python package in the image folder like below:
@@ -118,7 +118,7 @@ Then prepare the SageMaker TensorFlow Container python package in the image fold
118118
cd sagemaker-tensorflow-containers
119119
python setup.py sdist
120120

121-
#. Copy your Python package to final Dockerfile directory that you are building.
121+
#. Copy your Python package to "final" Dockerfile directory that you are building.
122122
cp dist/sagemaker_tensorflow_container-<package_version>.tar.gz docker/<tensorflow_version>/final/py2
123123

124124
If you want to build "final" Docker images, for versions 1.6 and above, you will first need to download the appropriate tensorflow pip wheel, then pass in its location as a build argument. These can be obtained from pypi. For example, the files for 1.6.0 are here:

docker/1.10.0/Dockerfile.cpu

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
FROM ubuntu:16.04
2+
3+
MAINTAINER Amazon AI
4+
5+
ARG framework_installable
6+
ARG framework_support_installable=sagemaker_tensorflow_container-2.0.0.tar.gz
7+
ARG py_version
8+
9+
# Validate that arguments are specified
10+
RUN test $framework_installable || exit 1 \
11+
&& test $py_version || exit 1
12+
13+
WORKDIR /root
14+
15+
COPY $framework_installable .
16+
COPY $framework_support_installable .
17+
18+
RUN apt-get update && apt-get install -y --no-install-recommends software-properties-common \
19+
&& add-apt-repository ppa:deadsnakes/ppa -y
20+
21+
RUN buildDeps=" \
22+
build-essential \
23+
curl \
24+
git \
25+
libcurl3-dev \
26+
libfreetype6-dev \
27+
libpng12-dev \
28+
libzmq3-dev \
29+
pkg-config \
30+
rsync \
31+
unzip \
32+
zip \
33+
zlib1g-dev \
34+
openjdk-8-jdk \
35+
openjdk-8-jre-headless \
36+
wget \
37+
vim \
38+
iputils-ping \
39+
nginx \
40+
" \
41+
&& apt-get update && apt-get install -y --no-install-recommends $buildDeps \
42+
&& apt-get clean \
43+
&& rm -rf /var/lib/apt/lists/*
44+
45+
RUN if [ $py_version -eq 3 ]; \
46+
then apt-get update && apt-get install -y --no-install-recommends python3.6-dev \
47+
&& ln -s -f /usr/bin/python3.6 /usr/bin/python; \
48+
else apt-get update && apt-get install -y --no-install-recommends python-dev; fi
49+
50+
RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \
51+
python get-pip.py && \
52+
rm get-pip.py
53+
54+
RUN pip install --upgrade \
55+
pip \
56+
setuptools
57+
58+
# Set environment variables for MKL
59+
# TODO: investigate the right value for OMP_NUM_THREADS
60+
# For more about MKL with TensorFlow see:
61+
# https://www.tensorflow.org/performance/performance_guide#tensorflow_with_intel%C2%AE_mkl_dnn
62+
ENV KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=0
63+
64+
RUN framework_installable_local=$(basename $framework_installable) \
65+
&& framework_support_installable_local=$(basename $framework_support_installable) \
66+
&& pip install --no-cache --upgrade $framework_installable_local \
67+
&& pip install $framework_support_installable_local \
68+
&& pip install "sagemaker-tensorflow>=1.10,<1.11" \
69+
\
70+
&& rm $framework_installable_local \
71+
&& rm $framework_support_installable_local
72+
73+
ENV SAGEMAKER_TRAINING_MODULE tf_container.training:main

docker/1.10.0/Dockerfile.gpu

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
FROM nvidia/cuda:9.0-base-ubuntu16.04
2+
3+
MAINTAINER Amazon AI
4+
5+
ARG framework_installable
6+
ARG framework_support_installable=sagemaker_tensorflow_container-2.0.0.tar.gz
7+
ARG py_version
8+
9+
# Validate that arguments are specified
10+
RUN test $framework_installable || exit 1 \
11+
&& test $py_version || exit 1
12+
13+
RUN apt-get update && apt-get install -y --no-install-recommends software-properties-common \
14+
&& add-apt-repository ppa:deadsnakes/ppa -y
15+
16+
RUN buildDeps=" \
17+
build-essential \
18+
cuda-command-line-tools-9-0 \
19+
cuda-cublas-dev-9-0 \
20+
cuda-cudart-dev-9-0 \
21+
cuda-cufft-dev-9-0 \
22+
cuda-curand-dev-9-0 \
23+
cuda-cusolver-dev-9-0 \
24+
cuda-cusparse-dev-9-0 \
25+
curl \
26+
git \
27+
libcudnn7=7.1.4.18-1+cuda9.0 \
28+
libcudnn7-dev=7.1.4.18-1+cuda9.0 \
29+
libcurl3-dev \
30+
libfreetype6-dev \
31+
libpng12-dev \
32+
libzmq3-dev \
33+
pkg-config \
34+
rsync \
35+
unzip \
36+
zip \
37+
zlib1g-dev \
38+
wget \
39+
vim \
40+
nginx \
41+
iputils-ping \
42+
" \
43+
&& apt-get update && apt-get install -y --no-install-recommends $buildDeps \
44+
&& rm -rf /var/lib/apt/lists/* \
45+
&& find /usr/local/cuda-9.0/lib64/ -type f -name 'lib*_static.a' -not -name 'libcudart_static.a' -delete \
46+
&& rm /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
47+
48+
RUN if [ $py_version -eq 3 ]; \
49+
then apt-get update && apt-get install -y --no-install-recommends python3.6-dev \
50+
&& ln -s -f /usr/bin/python3.6 /usr/bin/python; \
51+
else apt-get update && apt-get install -y --no-install-recommends python-dev; fi
52+
53+
RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \
54+
python get-pip.py && \
55+
rm get-pip.py
56+
57+
WORKDIR /root
58+
59+
COPY $framework_installable .
60+
COPY $framework_support_installable .
61+
62+
RUN framework_installable_local=$(basename $framework_installable) && \
63+
framework_support_installable_local=$(basename $framework_support_installable) && \
64+
\
65+
pip install --no-cache --upgrade $framework_installable_local && \
66+
pip install $framework_support_installable_local && \
67+
pip install "sagemaker-tensorflow>=1.10,<1.11" &&\
68+
\
69+
rm $framework_installable_local && \
70+
rm $framework_support_installable_local
71+
72+
ENV SAGEMAKER_TRAINING_MODULE tf_container.training:main

docker/1.10.0/final/py2/Dockerfile.cpu

Lines changed: 0 additions & 79 deletions
This file was deleted.

0 commit comments

Comments
 (0)