-
Notifications
You must be signed in to change notification settings - Fork 1.2k
change: allow smdistributed to be enabled with torch_distributed. #4129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
784510c
to
6acd3c3
Compare
cc: @rahul003 |
Update:Realized these tests require local |
4582897
to
7332b6f
Compare
3b081a5
to
c936d1c
Compare
9098d5a
to
8e2f08e
Compare
/bot run pr |
tests/unit/test_estimator.py
Outdated
} | ||
DISTRIBUTION_SM_TORCH_DIST_AND_DDP_DISABLED = { | ||
"smdistributed": {"enabled": True}, | ||
"torch_distributed": {"enabled": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
ed3c5a0
to
7d67d89
Compare
/bot run all |
/bot run pr |
1 similar comment
/bot run pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/bot run all
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
c213df5
to
861e2ba
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/bot run slow-tests, local-mode-tests
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
…use CUDA 12.1 enabled DLC containers and/or p5 instance with smdistributed enabled and torch-distributed disabled.
…imator. Add additional unsupported image tests. Clean up tests.
594b028
to
67e8813
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/bot run all
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/bot run pr
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Description of changes:
Update estimator to allow
smdistributed
to be enabled withtorch_distributed
for a newsmdistributed
release that supports usingsmdistributed
withtorch_distributed
.Disable launching of jobs with smddp on p5 instances or using the p5 image (
2.0.1-gpu-py310-cu121-ubuntu20.04-sagemaker-pr-3303
) withouttorch_distributed
enabled.Testing done:
Ran and passed
tests/unit/test_estimator.py
,tests/unit/test_pytorch.py
andtests/unit/test_fw_utils.py
locally.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.