Added torchrun compatibility for distributet training across multiple GPUs in a single node (single instance) #4766

brunopistone · 2024-07-02T18:01:17Z

Issue #, if available:

Description of changes:

Added possibility to execute remote function with torchrun command, for parallelizing training across multiple GPUs in a single node (single instance).
The functionality can be enabled in the @Remote function as following:

@remote(use_torchrun=True, nproc_per_node=2)
def train(....):
   pass

model = train(....)

Testing done:

Integration tests for evaluating back compatibility with original way of working, added integration test for checking compatibility with new functionality added

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
I used the commit message format described in CONTRIBUTING
I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
I have checked that my tests are not configured for a specific region or account (if appropriate)
I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

brunopistone · 2024-07-17T18:16:37Z

Any update @mohanasudhan?

mohanasudhan · 2024-07-24T17:17:37Z

@mufaddal-rohawala can you help with the review?

mufaddal-rohawala

Adding docstrngs is a blocker.

src/sagemaker/remote_function/client.py

sage-maker · 2024-08-08T13:18:24Z

@brunopistone Please review the test failures, thanks

…thon-sdk

sage-maker · 2024-08-08T14:17:41Z

@brunopistone
Looks like codestyle is still failing

sage-maker · 2024-08-08T14:19:19Z

src/sagemaker/remote_function/job.py

@@ -951,7 +1001,12 @@ def _get_job_name(job_settings, func):


 def _prepare_and_upload_runtime_scripts(
-    spark_config: SparkConfig, s3_base_uri: str, s3_kms_key: str, sagemaker_session: Session
+        spark_config: SparkConfig,


Too much whitespace, looks like two tab instead of 1 maybe

sage-maker · 2024-08-08T14:19:59Z

tests/integ/sagemaker/remote_function/test_decorator.py

@@ -818,3 +818,25 @@ def test_decorator_auto_capture(sagemaker_session, auto_capture_test_container):
        f"--rm {auto_capture_test_container}"
    )
    subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode("utf-8")
+
+def test_decorator_torchrun(
+        sagemaker_session,


Same here, needs 1 tab instead of 2.

sage-maker · 2024-08-08T15:06:10Z

@brunopistone
Your codestyle still failed, refer to client.py line 60 and test_decorator.py line 822. Please make sure you are running the linters locally, thanks

sage-maker · 2024-08-08T15:07:03Z

src/sagemaker/remote_function/client.py

@@ -58,7 +58,6 @@

 logger = logging_config.get_logger()



Needs a new line here

sage-maker · 2024-08-08T15:07:28Z

tests/integ/sagemaker/remote_function/test_decorator.py

@@ -818,3 +818,25 @@ def test_decorator_auto_capture(sagemaker_session, auto_capture_test_container):
        f"--rm {auto_capture_test_container}"
    )
    subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode("utf-8")
+


and a new line here

…thon-sdk

sage-maker · 2024-08-08T16:57:31Z

src/sagemaker/remote_function/job.py

+    fi
+
+    printf "INFO: Invoking remote function with torchrun inside conda environment: $conda_env.\\n"
+    $conda_exe run -n $conda_env torchrun --nproc_per_node $NPROC_PER_NODE -m sagemaker.remote_function.invoke_function "$@"


Line too long

…r items

…thon-sdk

brunopistone added 2 commits July 2, 2024 19:51

Added torchrun execution for remote jobs

fdbf6ba

added integration tests

c253d0b

brunopistone requested a review from a team as a code owner July 2, 2024 18:01

brunopistone requested a review from mohanasudhan July 2, 2024 18:01

brunopistone had a problem deploying to manual-approval July 2, 2024 18:01 — with GitHub Actions Error

mohanasudhan requested a review from mufaddal-rohawala July 24, 2024 17:17

mufaddal-rohawala requested changes Jul 24, 2024

View reviewed changes

src/sagemaker/remote_function/client.py Show resolved Hide resolved

docstring for use_torchrun and nproc_per_node

6815adb

brunopistone had a problem deploying to manual-approval July 25, 2024 07:28 — with GitHub Actions Error

Merge branch 'master' into master

05b2c61

brunopistone had a problem deploying to manual-approval July 26, 2024 10:51 — with GitHub Actions Error

Merge branch 'master' into master

f1b99a4

sage-maker had a problem deploying to manual-approval August 7, 2024 22:10 — with GitHub Actions Error

sage-maker previously approved these changes Aug 7, 2024

View reviewed changes

Merge branch 'master' into master

fb3015f

sage-maker had a problem deploying to manual-approval August 7, 2024 23:00 — with GitHub Actions Error

Merge branch 'master' into master

73a1a62

sage-maker temporarily deployed to manual-approval August 8, 2024 02:24 — with GitHub Actions Inactive

brunopistone added 2 commits August 8, 2024 15:49

code formatting

f6840d1

Merge branch 'master' of https://github.com/brunopistone/sagemaker-py…

60a421d

…thon-sdk

brunopistone dismissed sage-maker’s stale review via 60a421d August 8, 2024 13:51

brunopistone had a problem deploying to manual-approval August 8, 2024 13:51 — with GitHub Actions Error

added telemetry tracking

a61d042

brunopistone temporarily deployed to manual-approval August 8, 2024 14:05 — with GitHub Actions Inactive

sage-maker requested changes Aug 8, 2024

View reviewed changes

brunopistone temporarily deployed to manual-approval August 8, 2024 14:31 — with GitHub Actions Inactive

Merge branch 'master' into master

6fce4d6

sage-maker temporarily deployed to manual-approval August 8, 2024 14:51 — with GitHub Actions Inactive

sage-maker approved these changes Aug 8, 2024

View reviewed changes

sage-maker requested changes Aug 8, 2024

View reviewed changes

sage-maker requested a review from mufaddal-rohawala August 8, 2024 15:11

brunopistone added 2 commits August 8, 2024 17:21

runned linter

a508ebf

Merge branch 'master' of https://github.com/brunopistone/sagemaker-py…

020f29b

…thon-sdk

brunopistone temporarily deployed to manual-approval August 8, 2024 15:22 — with GitHub Actions Inactive

mufaddal-rohawala previously approved these changes Aug 8, 2024

View reviewed changes

sage-maker reviewed Aug 8, 2024

View reviewed changes

fixed string length, sagemaker_remote/job.py

634b8f6

brunopistone dismissed mufaddal-rohawala’s stale review via 634b8f6 August 8, 2024 17:11

brunopistone had a problem deploying to manual-approval August 8, 2024 17:12 — with GitHub Actions Error

Merge branch 'master' into master

ef92bcf

sage-maker temporarily deployed to manual-approval August 8, 2024 17:24 — with GitHub Actions Inactive

Merge branch 'master' into master

9681d91

sage-maker had a problem deploying to manual-approval August 8, 2024 18:07 — with GitHub Actions Error

Merge branch 'master' into master

7a31831

sage-maker temporarily deployed to manual-approval August 8, 2024 20:38 — with GitHub Actions Inactive

Merge branch 'master' into master

5ce1fbd

sage-maker temporarily deployed to manual-approval August 8, 2024 20:56 — with GitHub Actions Inactive

EC2 Default User and others added 2 commits August 8, 2024 22:40

fixed test cases for remote function, run test cases for all sagemake…

fb38454

…r items

Merge branch 'master' of https://github.com/brunopistone/sagemaker-py…

1ad86dd

…thon-sdk

brunopistone temporarily deployed to manual-approval August 8, 2024 22:42 — with GitHub Actions Inactive

reduced length of docstring

97c172e

brunopistone temporarily deployed to manual-approval August 8, 2024 23:42 — with GitHub Actions Inactive

sage-maker approved these changes Aug 9, 2024

View reviewed changes

sage-maker merged commit cbd2ed9 into aws:master Aug 9, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added torchrun compatibility for distributet training across multiple GPUs in a single node (single instance) #4766

Added torchrun compatibility for distributet training across multiple GPUs in a single node (single instance) #4766

Uh oh!

brunopistone commented Jul 2, 2024

Uh oh!

brunopistone commented Jul 17, 2024

Uh oh!

mohanasudhan commented Jul 24, 2024

Uh oh!

mufaddal-rohawala left a comment

Uh oh!

Uh oh!

sage-maker commented Aug 8, 2024

Uh oh!

sage-maker commented Aug 8, 2024

Uh oh!

sage-maker Aug 8, 2024

Uh oh!

sage-maker Aug 8, 2024

Uh oh!

sage-maker commented Aug 8, 2024

Uh oh!

sage-maker Aug 8, 2024

Uh oh!

sage-maker Aug 8, 2024

Uh oh!

sage-maker Aug 8, 2024

Uh oh!

Uh oh!

Uh oh!

Added torchrun compatibility for distributet training across multiple GPUs in a single node (single instance) #4766

Added torchrun compatibility for distributet training across multiple GPUs in a single node (single instance) #4766

Uh oh!

Conversation

brunopistone commented Jul 2, 2024

Merge Checklist

General

Tests

Uh oh!

brunopistone commented Jul 17, 2024

Uh oh!

mohanasudhan commented Jul 24, 2024

Uh oh!

mufaddal-rohawala left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sage-maker commented Aug 8, 2024

Uh oh!

sage-maker commented Aug 8, 2024

Uh oh!

sage-maker Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

sage-maker Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

sage-maker commented Aug 8, 2024

Uh oh!

sage-maker Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

sage-maker Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

sage-maker Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!