Feat: Support logging of MLFlow metrics when network_isolation mode is enabled #4880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

benieric merged 2 commits into aws:master from ryansteakley:jumpstart-mlflow-integration

Oct 3, 2024

Contributor

ryansteakley commented Oct 1, 2024 •

edited

Loading

Description of changes:
The goal of this feature is to integrate SageMaker Jumpstart (which runs in network isolation mode) with SageMaker MLflow. Specifically, we need to provide the option to forward metrics from a Jumpstart training job to a SageMaker MLflow tracking server (so the metrics can be viewed and tracked within experiments).

log_sagemaker_job_to_mlflow is a callable method only requiring the training_job_name as input.
Look into adding documentation such that user can call this method in-case of any failures during fit()

Testing done:

Run through modified jumpstart notebook

2024-09-30 14:12:00 Completed - Training job completed
Training seconds: 1004
Billable seconds: 1004
MLFLOW_EXPERIMENT_NAME not set. Using Default

2024[/09/30] 14:12:34 INFO mlflow.tracking._tracking_service.client: 🏃 View run unruly-fowl-461 at: https://us-west-2.experiments.sagemaker.aws/#/experiments/0/runs/15ad04e989ab42eb90b3637f528e3ac6.
2024/09/30 14:12:34 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://us-west-2.experiments.sagemaker.aws/#/experiments/0.

Logged 34 metric datapoints to MLflow
Logged 12 parameters to MLflow

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
I used the commit message format described in CONTRIBUTING
I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
I have checked that my tests are not configured for a specific region or account (if appropriate)
I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.


          feature: Support logging of MLFlow metrics when network_isolation mod…

99b7633

…e is enabled

ryansteakley requested a review from a team as a code owner

October 1, 2024 15:56

ryansteakley requested a review from mohanasudhan

October 1, 2024 15:56

ryansteakley temporarily deployed to manual-approval

October 1, 2024 15:57

— with

GitHub Actions Inactive

leoroy reviewed

View reviewed changes

src/sagemaker/mlflow/forward_sagemaker_metrics.py

+                      query = {
+                          "MetricName": metric["Name"],
+                          "XAxisType": "Timestamp",
+                          "MetricStat": "Avg",

Contributor

leoroy Oct 1, 2024

Does the metric stat matter here

Contributor Author

ryansteakley Oct 2, 2024

It matters in the same way that the Period field(1min,5min, 1hr) does. MetricStat can be Avg,Min,Max,Count,Sum etc. Forwarding every combination is possible but would need a different metric key for each combination. Can have defaults and introduce variables that allow user to call the method again with their desired combinations -> however this would run into the collision case if they use the same run unless we add some prefix/suffix to the keys names.

src/sagemaker/mlflow/forward_sagemaker_metrics.py

+                      boto3.exceptions.Boto3Error: If there's an issue with the AWS API call.
+                  """
+                  sagemaker_metrics_client = boto3.client("sagemaker-metrics")
+                  metric_data = sagemaker_metrics_client.batch_get_metrics(MetricQueries=metric_queries)

Contributor

leoroy Oct 1, 2024

I am assuming no pagination is needed here?

Contributor Author

ryansteakley Oct 2, 2024

There is no pagination support in batch_get_metrics as far as I can see.

src/sagemaker/mlflow/forward_sagemaker_metrics.py Outdated

Comment on lines 189 to 192

+                                  Metric(key=encoded_name, value=value, timestamp=timestamp, step=step)
+                                  for step, (timestamp, value) in enumerate(
+                                      zip(result["XAxisValues"], result["MetricValues"])
+                                  )

Contributor

leoroy Oct 1, 2024

Probably will be easier to read if having for loop as the outer block

Contributor Author

ryansteakley Oct 2, 2024

Changed

src/sagemaker/mlflow/forward_sagemaker_metrics.py

Comment on lines +314 to +315

		"metric_name_mapping": str(metric_name_mapping),
		"param_name_mapping": str(param_name_mapping),

Contributor

leoroy Oct 1, 2024

Is this part of the requirement? Will the length of metric name mapping or param name mapping exceed the limit for tags?

Contributor Author

ryansteakley Oct 2, 2024 •

edited

Loading

No its not part of the requirement. There is a 5000 char limit for the Tag Value. So yes it would exceed the limit once we have over ~250ish tag pairs(assuming the key length is 250) unless broken up into separate tags. The goal was to have a source of truth for the user to be able to determine the original metric names. Can remove this and rethink if needed. in my testing the value of length 1milllion did not raise errors however the tag value itself was truncated to only 5000 on ui and when getting the tag.

src/sagemaker/mlflow/forward_sagemaker_metrics.py

+                  """
+                  client = MlflowClient()
+                  experiment_name = os.getenv("MLFLOW_EXPERIMENT_NAME")

Contributor

leoroy Oct 1, 2024

See if we can pull the experiment and run name from the ExperimentConfig in the Training Job definition: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html#sagemaker-DescribeTrainingJob-response-ExperimentConfig

Contributor Author

ryansteakley Oct 2, 2024

This was discussed in our review sessions, we decided to not use the ExperimentConfig for such fields at this point. Instead relying on the env variables.

benieric previously approved these changes

View reviewed changes

pyproject.toml

@@ @@ -48,6 +48,7 @@ dependencies = [ @@
                 "PyYAML~=6.0",
                 "requests",
                 "sagemaker-core>=1.0.0,<2.0.0",
+                "sagemaker-mlflow",

Collaborator

benieric Oct 2, 2024

Can this be optional install -

sagemaker-python-sdk/hatch_build.py

Line 14 in e626647

def get_optional_dependencies(root):

Contributor Author

ryansteakley Oct 2, 2024

How will the user know to install these optional dependencies?

Collaborator

benieric Oct 2, 2024

This is sagemaker package so i think its good to install by default, usually we just want to double check on if a requirement is required to be installed by default


          Fix pylint + doc check

eb2cc11

ryansteakley dismissed benieric’s stale review via

eb2cc11

October 2, 2024 20:00

ryansteakley temporarily deployed to manual-approval

October 2, 2024 20:01

— with

GitHub Actions Inactive

benieric approved these changes

View reviewed changes

pyproject.toml

@@ @@ -48,6 +48,7 @@ dependencies = [ @@
                 "PyYAML~=6.0",
                 "requests",
                 "sagemaker-core>=1.0.0,<2.0.0",
+                "sagemaker-mlflow",

Collaborator

benieric Oct 2, 2024

This is sagemaker package so i think its good to install by default, usually we just want to double check on if a requirement is required to be installed by default

leoroy approved these changes

View reviewed changes

benieric merged commit d31e397 into aws:master

14 checks passed

aaravind100 commented Oct 4, 2024

Hey I just wanted to point out that introducing sagemaker-mlflow as a required dependency adds 46 more dependencies. Can we make this an optional dependency? I don't quite know if this would benefit all users.

This is in python 3.10.15, comparing sagemaker 2.232.1 and 2.232.2

uv pip install "sagemaker==2.232.1" --dry-run

Resolved 50 packages in 1.71s
Would download 31 packages
Would install 50 packages
 + annotated-types==0.7.0
 + attrs==23.2.0
 + boto3==1.35.33
 + botocore==1.35.33
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + cloudpickle==2.2.1
 + dill==0.3.9
 + docker==7.1.0
 + google-pasta==0.2.0
 + idna==3.10
 + importlib-metadata==6.11.0
 + jmespath==1.0.1
 + jsonschema==4.23.0
 + jsonschema-specifications==2023.12.1
 + markdown-it-py==3.0.0
 + mdurl==0.1.2
 + mock==4.0.3
 + multiprocess==0.70.17
 + numpy==1.26.4
 + packaging==24.1
 + pandas==2.2.3
 + pathos==0.3.3
 + platformdirs==4.3.6
 + pox==0.3.5
 + ppft==1.7.6.9
 + protobuf==4.25.5
 + psutil==6.0.0
 + pydantic==2.9.2
 + pydantic-core==2.23.4
 + pygments==2.18.0
 + python-dateutil==2.9.0.post0
 + pytz==2024.2
 + pyyaml==6.0.2
 + referencing==0.35.1
 + requests==2.32.3
 + rich==13.9.1
 + rpds-py==0.20.0
 + s3transfer==0.10.2
 + sagemaker==2.232.1
 + sagemaker-core==1.0.10
 + schema==0.7.7
 + six==1.16.0
 + smdebug-rulesconfig==1.0.1
 + tblib==3.0.0
 + tqdm==4.66.5
 + typing-extensions==4.12.2
 + tzdata==2024.2
 + urllib3==2.2.3
 + zipp==3.20.2

uv pip install "sagemaker==2.232.2" --dry-run

Resolved 97 packages in 1.31s
Would download 73 packages
Would install 97 packages
 + alembic==1.13.3
 + aniso8601==9.0.1
 + annotated-types==0.7.0
 + attrs==23.2.0
 + blinker==1.8.2
 + boto3==1.35.33
 + botocore==1.35.33
 + cachetools==5.5.0
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + click==8.1.7
 + cloudpickle==2.2.1
 + contourpy==1.3.0
 + cycler==0.12.1
 + databricks-sdk==0.33.0
 + deprecated==1.2.14
 + dill==0.3.9
 + docker==7.1.0
 + flask==3.0.3
 + fonttools==4.54.1
 + gitdb==4.0.11
 + gitpython==3.1.43
 + google-auth==2.35.0
 + google-pasta==0.2.0
 + graphene==3.3
 + graphql-core==3.2.4
 + graphql-relay==3.2.0
 + greenlet==3.1.1
 + gunicorn==23.0.0
 + idna==3.10
 + importlib-metadata==6.11.0
 + itsdangerous==2.2.0
 + jinja2==3.1.4
 + jmespath==1.0.1
 + joblib==1.4.2
 + jsonschema==4.23.0
 + jsonschema-specifications==2023.12.1
 + kiwisolver==1.4.7
 + mako==1.3.5
 + markdown==3.7
 + markdown-it-py==3.0.0
 + markupsafe==2.1.5
 + matplotlib==3.9.2
 + mdurl==0.1.2
 + mlflow==2.16.2
 + mlflow-skinny==2.16.2
 + mock==4.0.3
 + multiprocess==0.70.17
 + numpy==1.26.4
 + opentelemetry-api==1.27.0
 + opentelemetry-sdk==1.27.0
 + opentelemetry-semantic-conventions==0.48b0
 + packaging==24.1
 + pandas==2.2.3
 + pathos==0.3.3
 + pillow==10.4.0
 + platformdirs==4.3.6
 + pox==0.3.5
 + ppft==1.7.6.9
 + protobuf==4.25.5
 + psutil==6.0.0
 + pyarrow==17.0.0
 + pyasn1==0.6.1
 + pyasn1-modules==0.4.1
 + pydantic==2.9.2
 + pydantic-core==2.23.4
 + pygments==2.18.0
 + pyparsing==3.1.4
 + python-dateutil==2.9.0.post0
 + pytz==2024.2
 + pyyaml==6.0.2
 + referencing==0.35.1
 + requests==2.32.3
 + rich==13.9.1
 + rpds-py==0.20.0
 + rsa==4.9
 + s3transfer==0.10.2
 + sagemaker==2.232.2
 + sagemaker-core==1.0.10
 + sagemaker-mlflow==0.1.0
 + schema==0.7.7
 + scikit-learn==1.5.2
 + scipy==1.14.1
 + six==1.16.0
 + smdebug-rulesconfig==1.0.1
 + smmap==5.0.1
 + sqlalchemy==2.0.35
 + sqlparse==0.5.1
 + tblib==3.0.0
 + threadpoolctl==3.5.0
 + tqdm==4.66.5
 + typing-extensions==4.12.2
 + tzdata==2024.2
 + urllib3==2.2.3
 + werkzeug==3.0.4
 + wrapt==1.16.0
 + zipp==3.20.2

jgangelBLNG mentioned this pull request

Latest update [2.232.2] introduces MLFlow #4885

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet