-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Feat: Support logging of MLFlow metrics when network_isolation mode is enabled #4880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Support logging of MLFlow metrics when network_isolation mode is enabled #4880
Conversation
query = { | ||
"MetricName": metric["Name"], | ||
"XAxisType": "Timestamp", | ||
"MetricStat": "Avg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the metric stat matter here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It matters in the same way that the Period field(1min,5min, 1hr) does. MetricStat can be Avg,Min,Max,Count,Sum etc. Forwarding every combination is possible but would need a different metric key for each combination. Can have defaults and introduce variables that allow user to call the method again with their desired combinations -> however this would run into the collision case if they use the same run unless we add some prefix/suffix to the keys names.
boto3.exceptions.Boto3Error: If there's an issue with the AWS API call. | ||
""" | ||
sagemaker_metrics_client = boto3.client("sagemaker-metrics") | ||
metric_data = sagemaker_metrics_client.batch_get_metrics(MetricQueries=metric_queries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming no pagination is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no pagination support in batch_get_metrics
as far as I can see.
Metric(key=encoded_name, value=value, timestamp=timestamp, step=step) | ||
for step, (timestamp, value) in enumerate( | ||
zip(result["XAxisValues"], result["MetricValues"]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably will be easier to read if having for loop as the outer block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
"metric_name_mapping": str(metric_name_mapping), | ||
"param_name_mapping": str(param_name_mapping), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part of the requirement? Will the length of metric name mapping or param name mapping exceed the limit for tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No its not part of the requirement. There is a 5000 char limit for the Tag Value. So yes it would exceed the limit once we have over ~250ish tag pairs(assuming the key length is 250) unless broken up into separate tags. The goal was to have a source of truth for the user to be able to determine the original metric names. Can remove this and rethink if needed. in my testing the value of length 1milllion did not raise errors however the tag value itself was truncated to only 5000 on ui and when getting the tag.
""" | ||
client = MlflowClient() | ||
|
||
experiment_name = os.getenv("MLFLOW_EXPERIMENT_NAME") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if we can pull the experiment and run name from the ExperimentConfig in the Training Job definition: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html#sagemaker-DescribeTrainingJob-response-ExperimentConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed in our review sessions, we decided to not use the ExperimentConfig for such fields at this point. Instead relying on the env variables.
@@ -48,6 +48,7 @@ dependencies = [ | |||
"PyYAML~=6.0", | |||
"requests", | |||
"sagemaker-core>=1.0.0,<2.0.0", | |||
"sagemaker-mlflow", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be optional install -
sagemaker-python-sdk/hatch_build.py
Line 14 in e626647
def get_optional_dependencies(root): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the user know to install these optional dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sagemaker package so i think its good to install by default, usually we just want to double check on if a requirement is required to be installed by default
@@ -48,6 +48,7 @@ dependencies = [ | |||
"PyYAML~=6.0", | |||
"requests", | |||
"sagemaker-core>=1.0.0,<2.0.0", | |||
"sagemaker-mlflow", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sagemaker package so i think its good to install by default, usually we just want to double check on if a requirement is required to be installed by default
Hey I just wanted to point out that introducing This is in python 3.10.15, comparing sagemaker 2.232.1 and 2.232.2
|
Description of changes:
The goal of this feature is to integrate SageMaker Jumpstart (which runs in network isolation mode) with SageMaker MLflow. Specifically, we need to provide the option to forward metrics from a Jumpstart training job to a SageMaker MLflow tracking server (so the metrics can be viewed and tracked within experiments).
log_sagemaker_job_to_mlflow
is a callable method only requiring the training_job_name as input.Look into adding documentation such that user can call this method in-case of any failures during
fit()
Testing done:
Run through modified jumpstart notebook



Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.