Skip to content

change: allow smdistributed to be enabled with torch_distributed. #4129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 23, 2023

Conversation

viclzhu
Copy link
Collaborator

@viclzhu viclzhu commented Sep 19, 2023

Description of changes:
Update estimator to allow smdistributed to be enabled with torch_distributed for a new smdistributed release that supports using smdistributed with torch_distributed.
Disable launching of jobs with smddp on p5 instances or using the p5 image (2.0.1-gpu-py310-cu121-ubuntu20.04-sagemaker-pr-3303) without torch_distributed enabled.

Testing done:
Ran and passed tests/unit/test_estimator.py, tests/unit/test_pytorch.py and tests/unit/test_fw_utils.py locally.

General

  • I have read the CONTRIBUTING doc
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@viclzhu viclzhu requested a review from a team as a code owner September 19, 2023 19:09
@viclzhu viclzhu requested review from vpbhargav and removed request for a team September 19, 2023 19:09
@viclzhu viclzhu force-pushed the update-parameter-passing branch from 784510c to 6acd3c3 Compare September 19, 2023 19:10
@viclzhu
Copy link
Collaborator Author

viclzhu commented Sep 19, 2023

cc: @rahul003

@rohithn1
Copy link
Contributor

rohithn1 commented Oct 2, 2023

Note: The following unit tests failed before my and @viclzhu's commits.
======================================================= short test summary info ========================================================
FAILED tests/unit/test_estimator.py::test_local_mode - ValueError: Failed to get the Region information from the default config. Plea...
FAILED tests/unit/test_estimator.py::test_local_mode_file_output_path - ValueError: Failed to get the Region information from the def...
FAILED tests/unit/test_estimator.py::test_insert_invalid_source_code_args - ValueError: Must setup local AWS configuration with a reg...
FAILED tests/unit/test_estimator.py::test_output_path_default_bucket_and_prefix_combinations - ValueError: Failed to get the Region i...
FAILED tests/unit/test_estimator.py::test_stage_user_code_in_s3_default_bucket_and_prefix_combinations[s3://output-bucket/output-prefix/output-prefix2-s3://code-bucket/code-prefix/code-prefix2-expected__without_user_input__with_default_bucket_and_default_prefix0-expected__without_user_input__with_default_bucket_only0-expected__with_user_input__with_default_bucket_and_prefix0-expected__with_user_input__with_default_bucket_only0]
FAILED tests/unit/test_estimator.py::test_stage_user_code_in_s3_default_bucket_and_prefix_combinations[s3://output-bucket/output-prefix/output-prefix2-None-expected__without_user_input__with_default_bucket_and_default_prefix1-expected__without_user_input__with_default_bucket_only1-expected__with_user_input__with_default_bucket_and_prefix1-expected__with_user_input__with_default_bucket_only1]
FAILED tests/unit/test_estimator.py::test_stage_user_code_in_s3_default_bucket_and_prefix_combinations[None-s3://code-bucket/code-prefix/code-prefix2-expected__without_user_input__with_default_bucket_and_default_prefix2-expected__without_user_input__with_default_bucket_only2-expected__with_user_input__with_default_bucket_and_prefix2-expected__with_user_input__with_default_bucket_only2]
FAILED tests/unit/test_estimator.py::test_stage_user_code_in_s3_default_bucket_and_prefix_combinations[None-None-expected__without_user_input__with_default_bucket_and_default_prefix3-expected__without_user_input__with_default_bucket_only3-expected__with_user_input__with_default_bucket_and_prefix3-expected__with_user_input__with_default_bucket_only3]
FAILED tests/unit/test_estimator.py::test_stage_user_code_in_s3_default_bucket_and_prefix_combinations[output_path4-s3://code-bucket/code-prefix/code-prefix2-expected__without_user_input__with_default_bucket_and_default_prefix4-expected__without_user_input__with_default_bucket_only4-expected__with_user_input__with_default_bucket_and_prefix4-expected__with_user_input__with_default_bucket_only4]
FAILED tests/unit/test_estimator.py::test_stage_user_code_in_s3_default_bucket_and_prefix_combinations[output_path5-None-expected__without_user_input__with_default_bucket_and_default_prefix5-expected__without_user_input__with_default_bucket_only5-expected__with_user_input__with_default_bucket_and_prefix5-expected__with_user_input__with_default_bucket_only5]
FAILED tests/unit/test_estimator.py::test_stage_user_code_in_s3_default_bucket_and_prefix_combinations[file://output-bucket/output-prefix/output-prefix2-s3://code-bucket/code-prefix/code-prefix2-expected__without_user_input__with_default_bucket_and_default_prefix6-expected__without_user_input__with_default_bucket_only6-expected__with_user_input__with_default_bucket_and_prefix6-expected__with_user_input__with_default_bucket_only6]
FAILED tests/unit/test_estimator.py::test_stage_user_code_in_s3_default_bucket_and_prefix_combinations[file://output-bucket/output-prefix/output-prefix2-None-expected__without_user_input__with_default_bucket_and_default_prefix7-expected__without_user_input__with_default_bucket_only7-expected__with_user_input__with_default_bucket_and_prefix7-expected__with_user_input__with_default_bucket_only7]

Update:

Realized these tests require local .aws configuration and confirmed they pass with local configuration setup.

@viclzhu viclzhu marked this pull request as draft October 4, 2023 17:55
@viclzhu viclzhu force-pushed the update-parameter-passing branch from 4582897 to 7332b6f Compare October 4, 2023 21:34
@viclzhu viclzhu marked this pull request as ready for review October 5, 2023 00:54
@viclzhu viclzhu force-pushed the update-parameter-passing branch 3 times, most recently from 3b081a5 to c936d1c Compare October 5, 2023 07:21
@viclzhu viclzhu force-pushed the update-parameter-passing branch 2 times, most recently from 9098d5a to 8e2f08e Compare October 5, 2023 22:13
@vpbhargav
Copy link
Contributor

/bot run pr

}
DISTRIBUTION_SM_TORCH_DIST_AND_DDP_DISABLED = {
"smdistributed": {"enabled": True},
"torch_distributed": {"enabled": True}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@goelakash goelakash force-pushed the update-parameter-passing branch from ed3c5a0 to 7d67d89 Compare October 9, 2023 21:55
@goelakash
Copy link
Contributor

/bot run all

@goelakash
Copy link
Contributor

/bot run pr

1 similar comment
@goelakash
Copy link
Contributor

/bot run pr

Copy link
Contributor

@knikure knikure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/bot run all

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: c213df5
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@goelakash goelakash force-pushed the update-parameter-passing branch from c213df5 to 861e2ba Compare October 9, 2023 23:49
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 861e2ba
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 861e2ba
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: db1a1cb
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: db1a1cb
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@knikure knikure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/bot run slow-tests, local-mode-tests

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: db1a1cb
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: db1a1cb
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 594b028
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 594b028
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 594b028
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 594b028
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@knikure knikure self-assigned this Oct 18, 2023
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 594b028
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@bhupendrasingh bhupendrasingh force-pushed the update-parameter-passing branch from 594b028 to 67e8813 Compare October 23, 2023 16:58
Copy link
Contributor

@bhupendrasingh bhupendrasingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/bot run all

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 67e8813
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 67e8813
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 67e8813
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 67e8813
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 67e8813
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@knikure knikure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/bot run pr

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 67e8813
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@knikure knikure merged commit 4e6f2e0 into aws:master Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.