Skip to content

Upgrade to TensorFlow 1.13.1 #184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 8, 2019
Merged

Conversation

icywang86rui
Copy link
Contributor

  • Add 1.13.1 dockerfiles
  • Lower process number for one of the horovod local mode test
    Too many processes running in parallel causes the test fail sometimes.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Add 1.13.1 dockerfiles
* Lower process number for one of the horovod local mode test
    Too many processes running in parallel causes the test fail sometimes.
@icywang86rui icywang86rui changed the title Upgrade to TensorFlow 1.13.1 [DO NOT MERGE - WAITING ON CUDA10] [DO NOT MERGE - WAITING ON CUDA10] Upgrade to TensorFlow 1.13.1 Apr 26, 2019
@icywang86rui icywang86rui reopened this May 6, 2019
@icywang86rui icywang86rui changed the title [DO NOT MERGE - WAITING ON CUDA10] Upgrade to TensorFlow 1.13.1 Upgrade to TensorFlow 1.13.1 May 8, 2019
ldconfig && \
rm -rf /tmp/openmpi


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need two empty newlines

rm -rf /tmp/openmpi


# Create a wrapper for OpenMPI to allow running as root by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space before the pound sign

python get-pip.py --disable-pip-version-check --no-cache-dir "pip==18.1" && \
rm get-pip.py

ARG framework_installable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you group the args together?

markdown \
tensorboard

ENV SAGEMAKER_TRAINING_MODULE sagemaker_tensorflow_container.training:main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a newline to the end of the file

setup.py Outdated
@@ -53,7 +53,7 @@ def read(fname):
'pandas', 'Pillow', 'h5py'],
extras_require={
'test': ['tox', 'flake8', 'pytest', 'pytest-cov', 'pytest-xdist', 'mock',
'sagemaker>=1.15.2', 'tensorflow', 'docker-compose'],
'sagemaker>=1.19.1', 'tensorflow', 'docker-compose'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's pin this to reduce the blast radius of potential issues with Python SDK releases

@icywang86rui icywang86rui merged commit c097ca1 into aws:script-mode May 8, 2019

RUN pip install --no-cache-dir -U \
keras==2.2.4 \
mpi4py==3.0.1 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this when you have openmpi ?


RUN pip install --no-cache-dir -U \
keras==2.2.4 \
mpi4py==3.0.1 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

# Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
curl -fSsL -O https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-3.1.2.tar.gz && \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we install openmpi v4.0?

curl \
libcudnn7=7.4.1.5-1+cuda10.0 \
# TensorFlow doesn't require libnccl anymore but Open MPI still depends on it
libnccl2 \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is the dependencies of OpenMPI, and openmpi is pinned with version, could you please pin this version as well?

libcudnn7=7.4.1.5-1+cuda10.0 \
# TensorFlow doesn't require libnccl anymore but Open MPI still depends on it
libnccl2 \
libnccl-dev \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same above.

@andremoeller
Copy link
Contributor

Can you add a _wait_until_master_is_up function that's called before _wait_until_master_is_down?

This is so that non-master nodes don't immediately exit if they come up before master starts. This is a bigger risk with jobs with many instances.

Elizaaaaa pushed a commit to Elizaaaaa/sagemaker-tensorflow-container that referenced this pull request Nov 4, 2019
* Upgrade to TensorFlow 1.13.1

* Add 1.13.1 dockerfiles
* Lower process number for one of the horovod local mode test
    Too many processes running in parallel causes the test fail sometimes.
* Use multiprocessing.Process to start ps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants