Skip to content

Add mpi4py to pip installs #185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 30, 2019
Merged

Add mpi4py to pip installs #185

merged 1 commit into from
Apr 30, 2019

Conversation

laurenyu
Copy link
Contributor

@laurenyu laurenyu commented Apr 30, 2019

corresponds to aws/sagemaker-containers#191

Description of changes:
MPI may fail to detect process failure, which can cause a process to hang. mpi4py forces all processes to abort if an uncaught exception occurs. See https://docs.chainer.org/en/stable/chainermn/tutorial/tips_faqs.html#mpi-process-hangs-after-an-unhandled-python-exception (though the link is related to Chainer, it says that the problem is not specific to ChainerNM).

I tested locally (so CPU only) with the related SageMaker Containers change to ensure that Horovod should still work, including checking the assert statement mentioned in https://github.com/horovod/horovod#mpi4py.

edit: not sure why the CodeBuild run is still showing as a failure; I restarted it (failure was due to a tuning job name collision) and it passed: https://us-west-2.console.aws.amazon.com/codesuite/codebuild/projects/sagemaker-tensorflow-container-integ-p36-gpu/build/sagemaker-tensorflow-container-integ-p36-gpu%3A194011a1-f1c0-4135-8530-58f734484084/log

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@laurenyu laurenyu merged commit f40f010 into aws:script-mode Apr 30, 2019
@laurenyu laurenyu deleted the mpi4py branch April 30, 2019 20:11
Elizaaaaa pushed a commit to Elizaaaaa/sagemaker-tensorflow-container that referenced this pull request Nov 4, 2019
Elizaaaaa pushed a commit to Elizaaaaa/sagemaker-tensorflow-container that referenced this pull request Nov 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants