-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feature: Add data ingestion only data-wrangler flow recipe generation helper function #2336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
src/sagemaker/workflow/utilities.py
Outdated
"outputs": [ | ||
{ | ||
"name": "default", | ||
"sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is optional and only used for interactive mode within Studio. We can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so only:
"outputs": [{"name": "default"}]
?
src/sagemaker/workflow/utilities.py
Outdated
if s3_uri is not None: | ||
operator = "sagemaker.s3_source_0.1" | ||
input_definition = { | ||
"__typename": "S3CreateDatasetDefinitionOutput", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need those __typename
, It's probably a side effect from the frontend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, will remove
src/sagemaker/workflow/utilities.py
Outdated
operator = None | ||
|
||
if s3_uri is not None: | ||
operator = "sagemaker.s3_source_0.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the 0.1
part to an optional operator version parameter, since those operators could evolve in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean we make operator version as an optional parameter for this helper function?
src/sagemaker/workflow/utilities.py
Outdated
"""Generate the data ingestion only flow recipe | ||
|
||
Args: | ||
input_name (str): s3 input to recipe source node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not only for S3 right? Also could you change recipe
to flow
across this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, the input name is for all source inputs.
ack
src/sagemaker/workflow/utilities.py
Outdated
@@ -37,3 +42,105 @@ def list_to_request(entities: Sequence[Union[Entity, StepCollection]]) -> List[R | |||
elif isinstance(entity, StepCollection): | |||
request_dicts.extend(entity.request_dicts()) | |||
return request_dicts | |||
|
|||
|
|||
def generate_data_ingestion_flow_recipe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should break it down to separate functions for each input source. Combining them into the same interface seems a bit messy and user might be confused that they can provide multiple data sources with the current implemntation.
Also given this doesn't use workflow SDK construct in any way, I'd suggest we move it to the data wrangler module, perhaps sagemaker.wrangler.ingestion
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, sagemaker.wrangler.ingestion
looks like a good idea.
src/sagemaker/workflow/utilities.py
Outdated
"outputs": [{"name": "default"}], | ||
} | ||
|
||
recipe["nodes"].append(type_infer_and_cast_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the data types automatically so I probably need more context how you want to leverage the type information (presumably with an existing feature group). Is the expectation that after the type infer & cast node the result will always match the feature group type? What would the user do if it doesn't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the expectation that after the type infer & cast node the result will always match the feature group type?
yes, my thought is that this helper function only helps FG customers generating a bare minimal data ingestion flow, customers have to ensure the 1-1 mapping by themselves. To support fancy transformation including more complicated type casting (basically codify what user can do in the visual interface) I think there is a lot more efforts needed.
The goal for this helper function is only help FG customer easily ingest already pre-processed data via pipeline execution, and I would assume the preprocessed data already matches the FG expected typing.
What would the user do if it doesn't match?
The processing job will fail if it does not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but if the job failed, I don't think user can do much for corrective actions as this stands right now. Let's figure out whether we can suggest to do something from the UI or fix it in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so type_infer_and_cast
operation might lead to an incorrect matching? if so, I would think introducing the casting schema
for trained_parameters
as an optional parameter sounds like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specifically because of this infer type operation, but the new data does not necessarily align with existing feature group. How does the user know what/how to cast? I'm not too comfortable exposing the underlying Wrangler data type in the PySDK though since it's only used internal so far. Ideally I think this reconcile should be driven from the UI.
Another gap is the user need to provide types for the all the columns right now in order to use the trained_parameters
feature, which is not quite possible unless they open the flow from the UI. The full schema information is not stored anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too comfortable exposing the underlying Wrangler data type in the PySDK though since it's only used internal so far.
I think the exported notebook (export to feature store) exposes the data type (schema) already?
Another gap is the user need to provide types for the all the columns right now in order to use the trained_parameters feature
users can describe the feature group definition and convert the type accordingly by themselves, and supply as the optional schema as trained_parameters
. Otherwise, we could provide such converter too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the exported notebook (export to feature store) exposes the data type (schema) already?
Ah right, but It wasn't too obvious.
users can describe the feature group definition and convert the type accordingly by themselves, and supply as the optional schema as trained_parameters. Otherwise, we could provide such converter too.
The feature group definition could be just a subset of all the dataframe columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature group definition could be just a subset of all the dataframe columns.
but in our case, this helper only takes care of 1-1 mapping, right? otherwise, I assume there must be a transformation operation gets involved. If we choose to codify everything user can do via the visual interface, I guess it's a whole different story.
…rs to sagemaker.wrangler.ingestion
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, I wonder how would you run integration & canary test?
Otherwise LGTM
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
@@ -24,6 +24,7 @@ | |||
Mock, | |||
PropertyMock, | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file seems like nothings changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a new line was added here.
# ANY KIND, either express or implied. See the License for the specific | ||
# language governing permissions and limitations under the License. | ||
# language governing permissions and limitations under the License. | ||
"""Data wrangler helpers for data ingestion.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit Data Wrangler
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Issue #, if available:
Description of changes:
Add data ingestion only data-wrangler flow recipe generation helper function
Testing done:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.