-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Added torchrun compatibility for distributet training across multiple GPUs in a single node (single instance) #4766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Any update @mohanasudhan? |
@mufaddal-rohawala can you help with the review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding docstrngs is a blocker.
@brunopistone Please review the test failures, thanks |
@brunopistone |
src/sagemaker/remote_function/job.py
Outdated
@@ -951,7 +1001,12 @@ def _get_job_name(job_settings, func): | |||
|
|||
|
|||
def _prepare_and_upload_runtime_scripts( | |||
spark_config: SparkConfig, s3_base_uri: str, s3_kms_key: str, sagemaker_session: Session | |||
spark_config: SparkConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much whitespace, looks like two tab instead of 1 maybe
@@ -818,3 +818,25 @@ def test_decorator_auto_capture(sagemaker_session, auto_capture_test_container): | |||
f"--rm {auto_capture_test_container}" | |||
) | |||
subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode("utf-8") | |||
|
|||
def test_decorator_torchrun( | |||
sagemaker_session, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, needs 1 tab instead of 2.
@brunopistone |
@@ -58,7 +58,6 @@ | |||
|
|||
logger = logging_config.get_logger() | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a new line here
@@ -818,3 +818,25 @@ def test_decorator_auto_capture(sagemaker_session, auto_capture_test_container): | |||
f"--rm {auto_capture_test_container}" | |||
) | |||
subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode("utf-8") | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and a new line here
src/sagemaker/remote_function/job.py
Outdated
fi | ||
|
||
printf "INFO: Invoking remote function with torchrun inside conda environment: $conda_env.\\n" | ||
$conda_exe run -n $conda_env torchrun --nproc_per_node $NPROC_PER_NODE -m sagemaker.remote_function.invoke_function "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line too long
Issue #, if available:
Description of changes:
Added possibility to execute remote function with torchrun command, for parallelizing training across multiple GPUs in a single node (single instance).
The functionality can be enabled in the @Remote function as following:
Testing done:
Integration tests for evaluating back compatibility with original way of working, added integration test for checking compatibility with new functionality added
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.