Skip to content

Feat: Support logging of MLFlow metrics when network_isolation mode is enabled #4880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 3, 2024

Conversation

ryansteakley
Copy link
Contributor

@ryansteakley ryansteakley commented Oct 1, 2024

Description of changes:
The goal of this feature is to integrate SageMaker Jumpstart (which runs in network isolation mode) with SageMaker MLflow. Specifically, we need to provide the option to forward metrics from a Jumpstart training job to a SageMaker MLflow tracking server (so the metrics can be viewed and tracked within experiments).

log_sagemaker_job_to_mlflow is a callable method only requiring the training_job_name as input.
Look into adding documentation such that user can call this method in-case of any failures during fit()

Testing done:

Run through modified jumpstart notebook
Screenshot 2024-09-30 at 7 49 01 AM
Screenshot 2024-09-30 at 7 48 52 AM
Screenshot 2024-09-30 at 7 49 43 AM

2024-09-30 14:12:00 Completed - Training job completed
Training seconds: 1004
Billable seconds: 1004
MLFLOW_EXPERIMENT_NAME not set. Using Default

2024[/09/30] 14:12:34 INFO mlflow.tracking._tracking_service.client: 🏃 View run unruly-fowl-461 at: https://us-west-2.experiments.sagemaker.aws/#/experiments/0/runs/15ad04e989ab42eb90b3637f528e3ac6.
2024/09/30 14:12:34 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/0.

Logged 34 metric datapoints to MLflow
Logged 12 parameters to MLflow

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ryansteakley ryansteakley requested a review from a team as a code owner October 1, 2024 15:56
query = {
"MetricName": metric["Name"],
"XAxisType": "Timestamp",
"MetricStat": "Avg",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the metric stat matter here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matters in the same way that the Period field(1min,5min, 1hr) does. MetricStat can be Avg,Min,Max,Count,Sum etc. Forwarding every combination is possible but would need a different metric key for each combination. Can have defaults and introduce variables that allow user to call the method again with their desired combinations -> however this would run into the collision case if they use the same run unless we add some prefix/suffix to the keys names.

boto3.exceptions.Boto3Error: If there's an issue with the AWS API call.
"""
sagemaker_metrics_client = boto3.client("sagemaker-metrics")
metric_data = sagemaker_metrics_client.batch_get_metrics(MetricQueries=metric_queries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming no pagination is needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no pagination support in batch_get_metrics as far as I can see.

Comment on lines 189 to 192
Metric(key=encoded_name, value=value, timestamp=timestamp, step=step)
for step, (timestamp, value) in enumerate(
zip(result["XAxisValues"], result["MetricValues"])
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably will be easier to read if having for loop as the outer block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

Comment on lines +314 to +315
"metric_name_mapping": str(metric_name_mapping),
"param_name_mapping": str(param_name_mapping),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the requirement? Will the length of metric name mapping or param name mapping exceed the limit for tags?

Copy link
Contributor Author

@ryansteakley ryansteakley Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No its not part of the requirement. There is a 5000 char limit for the Tag Value. So yes it would exceed the limit once we have over ~250ish tag pairs(assuming the key length is 250) unless broken up into separate tags. The goal was to have a source of truth for the user to be able to determine the original metric names. Can remove this and rethink if needed. in my testing the value of length 1milllion did not raise errors however the tag value itself was truncated to only 5000 on ui and when getting the tag.

"""
client = MlflowClient()

experiment_name = os.getenv("MLFLOW_EXPERIMENT_NAME")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if we can pull the experiment and run name from the ExperimentConfig in the Training Job definition: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html#sagemaker-DescribeTrainingJob-response-ExperimentConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed in our review sessions, we decided to not use the ExperimentConfig for such fields at this point. Instead relying on the env variables.

benieric
benieric previously approved these changes Oct 2, 2024
@@ -48,6 +48,7 @@ dependencies = [
"PyYAML~=6.0",
"requests",
"sagemaker-core>=1.0.0,<2.0.0",
"sagemaker-mlflow",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be optional install -

def get_optional_dependencies(root):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the user know to install these optional dependencies?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sagemaker package so i think its good to install by default, usually we just want to double check on if a requirement is required to be installed by default

@@ -48,6 +48,7 @@ dependencies = [
"PyYAML~=6.0",
"requests",
"sagemaker-core>=1.0.0,<2.0.0",
"sagemaker-mlflow",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sagemaker package so i think its good to install by default, usually we just want to double check on if a requirement is required to be installed by default

@benieric benieric merged commit d31e397 into aws:master Oct 3, 2024
14 checks passed
@aaravind100
Copy link

Hey I just wanted to point out that introducing sagemaker-mlflow as a required dependency adds 46 more dependencies. Can we make this an optional dependency? I don't quite know if this would benefit all users.

This is in python 3.10.15, comparing sagemaker 2.232.1 and 2.232.2

  • uv pip install "sagemaker==2.232.1" --dry-run
Resolved 50 packages in 1.71s
Would download 31 packages
Would install 50 packages
 + annotated-types==0.7.0
 + attrs==23.2.0
 + boto3==1.35.33
 + botocore==1.35.33
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + cloudpickle==2.2.1
 + dill==0.3.9
 + docker==7.1.0
 + google-pasta==0.2.0
 + idna==3.10
 + importlib-metadata==6.11.0
 + jmespath==1.0.1
 + jsonschema==4.23.0
 + jsonschema-specifications==2023.12.1
 + markdown-it-py==3.0.0
 + mdurl==0.1.2
 + mock==4.0.3
 + multiprocess==0.70.17
 + numpy==1.26.4
 + packaging==24.1
 + pandas==2.2.3
 + pathos==0.3.3
 + platformdirs==4.3.6
 + pox==0.3.5
 + ppft==1.7.6.9
 + protobuf==4.25.5
 + psutil==6.0.0
 + pydantic==2.9.2
 + pydantic-core==2.23.4
 + pygments==2.18.0
 + python-dateutil==2.9.0.post0
 + pytz==2024.2
 + pyyaml==6.0.2
 + referencing==0.35.1
 + requests==2.32.3
 + rich==13.9.1
 + rpds-py==0.20.0
 + s3transfer==0.10.2
 + sagemaker==2.232.1
 + sagemaker-core==1.0.10
 + schema==0.7.7
 + six==1.16.0
 + smdebug-rulesconfig==1.0.1
 + tblib==3.0.0
 + tqdm==4.66.5
 + typing-extensions==4.12.2
 + tzdata==2024.2
 + urllib3==2.2.3
 + zipp==3.20.2
  • uv pip install "sagemaker==2.232.2" --dry-run
Resolved 97 packages in 1.31s
Would download 73 packages
Would install 97 packages
 + alembic==1.13.3
 + aniso8601==9.0.1
 + annotated-types==0.7.0
 + attrs==23.2.0
 + blinker==1.8.2
 + boto3==1.35.33
 + botocore==1.35.33
 + cachetools==5.5.0
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + click==8.1.7
 + cloudpickle==2.2.1
 + contourpy==1.3.0
 + cycler==0.12.1
 + databricks-sdk==0.33.0
 + deprecated==1.2.14
 + dill==0.3.9
 + docker==7.1.0
 + flask==3.0.3
 + fonttools==4.54.1
 + gitdb==4.0.11
 + gitpython==3.1.43
 + google-auth==2.35.0
 + google-pasta==0.2.0
 + graphene==3.3
 + graphql-core==3.2.4
 + graphql-relay==3.2.0
 + greenlet==3.1.1
 + gunicorn==23.0.0
 + idna==3.10
 + importlib-metadata==6.11.0
 + itsdangerous==2.2.0
 + jinja2==3.1.4
 + jmespath==1.0.1
 + joblib==1.4.2
 + jsonschema==4.23.0
 + jsonschema-specifications==2023.12.1
 + kiwisolver==1.4.7
 + mako==1.3.5
 + markdown==3.7
 + markdown-it-py==3.0.0
 + markupsafe==2.1.5
 + matplotlib==3.9.2
 + mdurl==0.1.2
 + mlflow==2.16.2
 + mlflow-skinny==2.16.2
 + mock==4.0.3
 + multiprocess==0.70.17
 + numpy==1.26.4
 + opentelemetry-api==1.27.0
 + opentelemetry-sdk==1.27.0
 + opentelemetry-semantic-conventions==0.48b0
 + packaging==24.1
 + pandas==2.2.3
 + pathos==0.3.3
 + pillow==10.4.0
 + platformdirs==4.3.6
 + pox==0.3.5
 + ppft==1.7.6.9
 + protobuf==4.25.5
 + psutil==6.0.0
 + pyarrow==17.0.0
 + pyasn1==0.6.1
 + pyasn1-modules==0.4.1
 + pydantic==2.9.2
 + pydantic-core==2.23.4
 + pygments==2.18.0
 + pyparsing==3.1.4
 + python-dateutil==2.9.0.post0
 + pytz==2024.2
 + pyyaml==6.0.2
 + referencing==0.35.1
 + requests==2.32.3
 + rich==13.9.1
 + rpds-py==0.20.0
 + rsa==4.9
 + s3transfer==0.10.2
 + sagemaker==2.232.2
 + sagemaker-core==1.0.10
 + sagemaker-mlflow==0.1.0
 + schema==0.7.7
 + scikit-learn==1.5.2
 + scipy==1.14.1
 + six==1.16.0
 + smdebug-rulesconfig==1.0.1
 + smmap==5.0.1
 + sqlalchemy==2.0.35
 + sqlparse==0.5.1
 + tblib==3.0.0
 + threadpoolctl==3.5.0
 + tqdm==4.66.5
 + typing-extensions==4.12.2
 + tzdata==2024.2
 + urllib3==2.2.3
 + werkzeug==3.0.4
 + wrapt==1.16.0
 + zipp==3.20.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants