-
Notifications
You must be signed in to change notification settings - Fork 162
Upgrade to TensorFlow 1.13.1 #184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Add 1.13.1 dockerfiles * Lower process number for one of the horovod local mode test Too many processes running in parallel causes the test fail sometimes.
docker/1.13.1/Dockerfile.cpu
Outdated
ldconfig && \ | ||
rm -rf /tmp/openmpi | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need two empty newlines
docker/1.13.1/Dockerfile.cpu
Outdated
rm -rf /tmp/openmpi | ||
|
||
|
||
# Create a wrapper for OpenMPI to allow running as root by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space before the pound sign
docker/1.13.1/Dockerfile.cpu
Outdated
python get-pip.py --disable-pip-version-check --no-cache-dir "pip==18.1" && \ | ||
rm get-pip.py | ||
|
||
ARG framework_installable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you group the args together?
docker/1.13.1/Dockerfile.cpu
Outdated
markdown \ | ||
tensorboard | ||
|
||
ENV SAGEMAKER_TRAINING_MODULE sagemaker_tensorflow_container.training:main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a newline to the end of the file
setup.py
Outdated
@@ -53,7 +53,7 @@ def read(fname): | |||
'pandas', 'Pillow', 'h5py'], | |||
extras_require={ | |||
'test': ['tox', 'flake8', 'pytest', 'pytest-cov', 'pytest-xdist', 'mock', | |||
'sagemaker>=1.15.2', 'tensorflow', 'docker-compose'], | |||
'sagemaker>=1.19.1', 'tensorflow', 'docker-compose'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's pin this to reduce the blast radius of potential issues with Python SDK releases
|
||
RUN pip install --no-cache-dir -U \ | ||
keras==2.2.4 \ | ||
mpi4py==3.0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this when you have openmpi ?
|
||
RUN pip install --no-cache-dir -U \ | ||
keras==2.2.4 \ | ||
mpi4py==3.0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
# Install Open MPI | ||
RUN mkdir /tmp/openmpi && \ | ||
cd /tmp/openmpi && \ | ||
curl -fSsL -O https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-3.1.2.tar.gz && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we install openmpi v4.0?
curl \ | ||
libcudnn7=7.4.1.5-1+cuda10.0 \ | ||
# TensorFlow doesn't require libnccl anymore but Open MPI still depends on it | ||
libnccl2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is the dependencies of OpenMPI, and openmpi is pinned with version, could you please pin this version as well?
libcudnn7=7.4.1.5-1+cuda10.0 \ | ||
# TensorFlow doesn't require libnccl anymore but Open MPI still depends on it | ||
libnccl2 \ | ||
libnccl-dev \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above.
Can you add a This is so that non-master nodes don't immediately exit if they come up before master starts. This is a bigger risk with jobs with many instances. |
* Upgrade to TensorFlow 1.13.1 * Add 1.13.1 dockerfiles * Lower process number for one of the horovod local mode test Too many processes running in parallel causes the test fail sometimes. * Use multiprocessing.Process to start ps
Too many processes running in parallel causes the test fail sometimes.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.