-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feature: EMR step runtime role support #3703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
@thbrooks22 Can you fix the flake8 and black check issues in your code? |
@@ -155,6 +163,7 @@ def __init__( | |||
depends_on: Optional[List[Union[str, Step, StepCollection]]] = None, | |||
cache_config: CacheConfig = None, | |||
cluster_config: Dict[str, Any] = None, | |||
execution_role_arn: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class init method is public facing, which means once we release this change, it's not able to update the parameters here until next big version bump up, e.g. v2 -> v3.
So, have we considered why this is a good place to add this execution_role_arn
arg? Or in other words, why execution_role_arn
is not appropriate to be added in step_config
.
Also let's add @aoguo64 to review as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
step_config
is mapped to StepConfig in EMR's AddJobFlowSteps, it's not a single-step configuration but applies to all steps to run in a single AddJobFlowSteps (though Pipelines does only run one job step per EMR step).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qidewenwhen are you suggesting using kwargs intead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, then step_config
is not a good place to go, in terms of execution_role_arn
.
I'm just hoping to raise discussion and caution on this public facing interface change as it's not revertible.
And I do have a concern that if EMR side keeps adding new args, our EMRStep interface will keep expanding and eventually it may be hard to use, see example of the obsoleted RegisterModel
interface which contains ~30 args: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/workflow/step_collections.py#L57-L94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So **kwargs seems to be able to resolve my concern above but usually it may not be a good practice to include **kwargs.
Thus I'd suggest to have thorough consideration on this step interface and have more eyes on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced up offline: there's no better approach to avoid the call out issue. We will introduce a step_arg
parameter to group all EMRStep argument related parameters in the future when the step interface grows too big.
src/sagemaker/workflow/emr_step.py
Outdated
@@ -198,9 +208,16 @@ def __init__( | |||
if cluster_id is not None and cluster_config is not None: | |||
raise ValueError(ERR_STR_WITH_BOTH_CLUSTER_ID_AND_CLUSTER_CFG.format(step_name=name)) | |||
|
|||
if execution_role_arn is not None and cluster_config is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems execution_role_arn
has to come along with cluster_id
and the former is used to allow accessing the cluster_id
right?
If it's true, then can we make the logic here more intuitive? e.g. check that execution_role_arn can be not None only if cluster_id is not None.
And it'd be good to improve the error message to make it more informative, e.g. "execution_role_arn should not be set if ... since it's used to ...", so that it can educate the users why the proper usage looks like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two ways are equally intuitive here. There are two ways for the customer to make an EMR step: by specifying ID of existing cluster to add steps to, or by specifying config of new cluster to be created with steps. EMR runtime roles are only supported on the former request (where AddJobFlowSteps
API is used) and not on the latter (where RunJobFlow
API is used).
How much info should the error message include? Should it contain all the above background?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it'd be better to include the following in the error message:
- The context - what EMR runtime role is used for, i.e. to add step with the existing cluster.
- at least we should include this in the sdk doc string as we don't expect customers to carefully explore all related docs, e.g. API docs
- What's wrong - i.e. EMR runtime role should not be set if
cluster_config
is not None.
Given this context, that's why I feel it'd be better to group execution_role_arn
validation with cluster_id
. But as long as we put enough info in the doc string or error message to educate users and future developers, either way is fine.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
As the PR is public facing and is retrievable after merge, let's avoid adding any internal link here:
|
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
acfafd8
to
ea28e10
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Codecov Report
@@ Coverage Diff @@
## master #3703 +/- ##
==========================================
- Coverage 89.84% 89.06% -0.78%
==========================================
Files 984 231 -753
Lines 92360 22835 -69525
==========================================
- Hits 82984 20339 -62645
+ Misses 9376 2496 -6880
... and 1214 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
src/sagemaker/workflow/emr_step.py
Outdated
@@ -198,9 +208,16 @@ def __init__( | |||
if cluster_id is not None and cluster_config is not None: | |||
raise ValueError(ERR_STR_WITH_BOTH_CLUSTER_ID_AND_CLUSTER_CFG.format(step_name=name)) | |||
|
|||
if execution_role_arn is not None and cluster_config is not None: | |||
raise ValueError(ERR_STR_WITH_BOTH_EXEC_ROLE_ARN_AND_CLUSTER_CFG.format(step_name=name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we need to validate? This seems like a requirement from EMR side right? Should we start the EMR job and let it do this validation instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this validation in the DSL. I used the same pattern that we are enforcing for cluster_id
and cluster_config
: for these, we are checking both in the SDK and in the DSL that they are not both set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curious, do you know is there any reason we need to do this checking in both places?
If the DSL validation already covers these and they happen during creating pipeline time, we don't need to repeat the same validation in SDK.
However, I'm fine to follow the same pattern that we already have for cluster_id and cluster_config in this PR.
2caa265
to
a0f11f8
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
a0f11f8
to
1e0790a
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Co-authored-by: Thomas Brooks <[email protected]>
Co-authored-by: Thomas Brooks <[email protected]>
Issue #, if available: N/A
Description of changes: Allow
runtime_role
parameter toEMRStep
step in SageMaker Pipelines. Note: I'd like to get this reviewed now, but we cannot merge until the backend changes are available in all new regions. I will keep this PR updated with that info, and engage SDK team as needed with updates.Testing done: Two new unit tests, and one modification to an existing test.
ExecutionRoleArn
fields where appropriate if it's included in theEMRStep
arguments.cluster_config
andexecution_role_arn
are both not None.ExecutionRoleArn
is not None, then it's not serialized to pipelines JSON.Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.