feature: EMR step runtime role support #3703

thbrooks22 · 2023-03-08T17:44:25Z

Issue #, if available: N/A

Description of changes: Allow runtime_role parameter to EMRStep step in SageMaker Pipelines. Note: I'd like to get this reviewed now, but we cannot merge until the backend changes are available in all new regions. I will keep this PR updated with that info, and engage SDK team as needed with updates.

Testing done: Two new unit tests, and one modification to an existing test.

[New]: Check that the pipeline JSON contains ExecutionRoleArn fields where appropriate if it's included in the EMRStep arguments.
[New]: Check that SDK throws exception if cluster_config and execution_role_arn are both not None.
[Existing]: Check that if ExecutionRoleArn is not None, then it's not serialized to pipelines JSON.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
I used the commit message format described in CONTRIBUTING
I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
I have checked that my tests are not configured for a specific region or account (if appropriate)
I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

sagemaker-bot · 2023-03-08T17:54:37Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: acfafd8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-08T18:40:02Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: acfafd8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-08T18:41:00Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: acfafd8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-08T18:58:14Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: acfafd8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

knikure · 2023-03-08T19:44:14Z

@thbrooks22 Can you fix the flake8 and black check issues in your code?
Run tox -e black-format and then ./.githooks/pre-push to run all the checks.

qidewenwhen · 2023-03-08T19:57:26Z

src/sagemaker/workflow/emr_step.py

@@ -155,6 +163,7 @@ def __init__(
        depends_on: Optional[List[Union[str, Step, StepCollection]]] = None,
        cache_config: CacheConfig = None,
        cluster_config: Dict[str, Any] = None,
+        execution_role_arn: str = None,


This class init method is public facing, which means once we release this change, it's not able to update the parameters here until next big version bump up, e.g. v2 -> v3.
So, have we considered why this is a good place to add this execution_role_arn arg? Or in other words, why execution_role_arn is not appropriate to be added in step_config.

Also let's add @aoguo64 to review as well.

step_config is mapped to StepConfig in EMR's AddJobFlowSteps, it's not a single-step configuration but applies to all steps to run in a single AddJobFlowSteps (though Pipelines does only run one job step per EMR step).

@qidewenwhen are you suggesting using kwargs intead?

I see, then step_config is not a good place to go, in terms of execution_role_arn.

I'm just hoping to raise discussion and caution on this public facing interface change as it's not revertible.
And I do have a concern that if EMR side keeps adding new args, our EMRStep interface will keep expanding and eventually it may be hard to use, see example of the obsoleted RegisterModel interface which contains ~30 args: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/workflow/step_collections.py#L57-L94

So **kwargs seems to be able to resolve my concern above but usually it may not be a good practice to include **kwargs.
Thus I'd suggest to have thorough consideration on this step interface and have more eyes on it.

Synced up offline: there's no better approach to avoid the call out issue. We will introduce a step_arg parameter to group all EMRStep argument related parameters in the future when the step interface grows too big.

qidewenwhen · 2023-03-08T20:22:47Z

src/sagemaker/workflow/emr_step.py

@@ -198,9 +208,16 @@ def __init__(
        if cluster_id is not None and cluster_config is not None:
            raise ValueError(ERR_STR_WITH_BOTH_CLUSTER_ID_AND_CLUSTER_CFG.format(step_name=name))

+        if execution_role_arn is not None and cluster_config is not None:


Seems execution_role_arn has to come along with cluster_id and the former is used to allow accessing the cluster_id right?

If it's true, then can we make the logic here more intuitive? e.g. check that execution_role_arn can be not None only if cluster_id is not None.

And it'd be good to improve the error message to make it more informative, e.g. "execution_role_arn should not be set if ... since it's used to ...", so that it can educate the users why the proper usage looks like this.

I think the two ways are equally intuitive here. There are two ways for the customer to make an EMR step: by specifying ID of existing cluster to add steps to, or by specifying config of new cluster to be created with steps. EMR runtime roles are only supported on the former request (where AddJobFlowSteps API is used) and not on the latter (where RunJobFlow API is used).

How much info should the error message include? Should it contain all the above background?

I feel it'd be better to include the following in the error message:

The context - what EMR runtime role is used for, i.e. to add step with the existing cluster.

at least we should include this in the sdk doc string as we don't expect customers to carefully explore all related docs, e.g. API docs

What's wrong - i.e. EMR runtime role should not be set if cluster_config is not None.

Given this context, that's why I feel it'd be better to group execution_role_arn validation with cluster_id. But as long as we put enough info in the doc string or error message to educate users and future developers, either way is fine.

sagemaker-bot · 2023-03-08T20:51:39Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: acfafd8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

qidewenwhen · 2023-03-08T21:49:53Z

As the PR is public facing and is retrievable after merge, let's avoid adding any internal link here:

Issue #, if available: Customer issue tracked here.

sagemaker-bot · 2023-03-08T22:06:46Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: acfafd8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-08T22:45:31Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: acfafd8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-08T22:52:45Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: acfafd8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-08T22:58:19Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: acfafd8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-09T00:55:00Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: acfafd8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-09T21:02:23Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: ea28e10
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-09T21:33:44Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: ea28e10
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-09T21:45:31Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: ea28e10
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-09T21:47:29Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: ea28e10
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

codecov-commenter · 2023-03-09T21:47:55Z

Codecov Report

Merging #3703 (a0f11f8) into master (4fb7c01) will decrease coverage by 0.78%.
The diff coverage is 100.00%.

❗ Current head a0f11f8 differs from pull request most recent head 1e0790a. Consider uploading reports for the commit 1e0790a to get more accurate results

@@            Coverage Diff             @@
##           master    #3703      +/-   ##
==========================================
- Coverage   89.84%   89.06%   -0.78%     
==========================================
  Files         984      231     -753     
  Lines       92360    22835   -69525     
==========================================
- Hits        82984    20339   -62645     
+ Misses       9376     2496    -6880

Impacted Files	Coverage Δ
src/sagemaker/workflow/emr_step.py	`100.00% <100.00%> (ø)`

... and 1214 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

sagemaker-bot · 2023-03-09T21:52:26Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: ea28e10
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

nmadan · 2023-03-09T22:10:47Z

src/sagemaker/workflow/emr_step.py

@@ -198,9 +208,16 @@ def __init__(
        if cluster_id is not None and cluster_config is not None:
            raise ValueError(ERR_STR_WITH_BOTH_CLUSTER_ID_AND_CLUSTER_CFG.format(step_name=name))

+        if execution_role_arn is not None and cluster_config is not None:
+            raise ValueError(ERR_STR_WITH_BOTH_EXEC_ROLE_ARN_AND_CLUSTER_CFG.format(step_name=name))


Is this something we need to validate? This seems like a requirement from EMR side right? Should we start the EMR job and let it do this validation instead?

We do this validation in the DSL. I used the same pattern that we are enforcing for cluster_id and cluster_config: for these, we are checking both in the SDK and in the DSL that they are not both set.

Out of curious, do you know is there any reason we need to do this checking in both places?
If the DSL validation already covers these and they happen during creating pipeline time, we don't need to repeat the same validation in SDK.
However, I'm fine to follow the same pattern that we already have for cluster_id and cluster_config in this PR.

sagemaker-bot · 2023-03-14T17:32:22Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: a0f11f8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-14T17:44:33Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: a0f11f8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-14T17:48:35Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: a0f11f8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-14T17:52:33Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: a0f11f8
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-14T19:49:38Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: a0f11f8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-22T21:10:21Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: 1e0790a
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-22T21:15:44Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 1e0790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-22T21:19:07Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: 1e0790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-22T21:22:17Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: 1e0790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-22T21:22:53Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: 1e0790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-23T16:31:35Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: 1e0790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2023-03-23T17:08:08Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: 1e0790a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Co-authored-by: Thomas Brooks <[email protected]>

thbrooks22 requested a review from a team as a code owner March 8, 2023 17:44

thbrooks22 requested review from knikure and removed request for a team March 8, 2023 17:44

knikure self-assigned this Mar 8, 2023

qidewenwhen reviewed Mar 8, 2023

View reviewed changes

thbrooks22 force-pushed the emr_step_runtime_role_support branch from acfafd8 to ea28e10 Compare March 9, 2023 20:56

nmadan reviewed Mar 9, 2023

View reviewed changes

knikure added the do-not-merge Use when PR needs to be marked as do not merge label Mar 14, 2023

thbrooks22 force-pushed the emr_step_runtime_role_support branch 2 times, most recently from 2caa265 to a0f11f8 Compare March 14, 2023 16:55

aoguo64 approved these changes Mar 15, 2023

View reviewed changes

qidewenwhen approved these changes Mar 20, 2023

View reviewed changes

feature: support runtime role on pipelines EMR step

1e0790a

thbrooks22 force-pushed the emr_step_runtime_role_support branch from a0f11f8 to 1e0790a Compare March 22, 2023 20:27

qidewenwhen approved these changes Mar 22, 2023

View reviewed changes

thbrooks22 removed the do-not-merge Use when PR needs to be marked as do not merge label Mar 23, 2023

knikure approved these changes Mar 23, 2023

View reviewed changes

knikure merged commit 04e3f60 into aws:master Mar 23, 2023

doddaspk-amzn pushed a commit to doddaspk-amzn/sagemaker-python-sdk that referenced this pull request Apr 6, 2023

feature: EMR step runtime role support (aws#3703)

b5cfa86

Co-authored-by: Thomas Brooks <[email protected]>

nmadan pushed a commit to nmadan/sagemaker-python-sdk that referenced this pull request Apr 18, 2023

feature: EMR step runtime role support (aws#3703)

68a12a7

Co-authored-by: Thomas Brooks <[email protected]>

feature: EMR step runtime role support #3703

feature: EMR step runtime role support #3703

Uh oh!

Conversation

thbrooks22 commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Checklist

General

Tests

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

knikure commented Mar 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qidewenwhen Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

qidewenwhen commented Mar 8, 2023

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 8, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 9, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 9, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 9, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 9, 2023

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Mar 9, 2023

AWS CodeBuild CI Report

Uh oh!

codecov-commenter commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

thbrooks22 commented Mar 8, 2023 •

edited

Loading

qidewenwhen Mar 8, 2023 •

edited

Loading

codecov-commenter commented Mar 9, 2023 •

edited

Loading