Skip to content

change: Improve defaults handling in ModelTrainer #5170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 15, 2025

Conversation

benieric
Copy link
Collaborator

@benieric benieric commented May 12, 2025

Issue #, if available: #5047 #5116

Description of changes:

  • Add missing dependency for ModelTrainer causing import errors

  • Improve defaults handling in ModelTrainer

  • Align all training artifacts to be uploaded to the default s3 path for a training job - s3://<default_bucket>/<default_prefix>/<base_job_name>/<job_name>/

  • Default handling of InputDataConfig - any inputs uploaded by ModelTrainer will be uploaded under the input directory in artifacts bucket for training job like below.

 default_bucket
   - default_bucket_prefix
    	- base_job_name
        	- base_job_name-<timestamp>
        		- input
  • Default handling of OutputDataConfig - when provided a path like output_data_config.s3_output_path = f"s3://{default_bucket}/{default_bucket_prefix}/{base_job_name}, Output artifacts will appear in the artifacts bucket for the training job like. Platform will automatically <training_job_name>/output to the s3_output_path :
 default_bucket
   - default_bucket_prefix
    	- base_job_name
        	- base_job_name-<timestamp>
        		- output 
  • Default handling of TensorBoardOutputConfig - when provided a path like tensor_board_output_config.s3_output_path = f"s3://{default_bucket}/{default_bucket_prefix}/{base_job_name}, TensorBoard output artifacts will appear in the artifacts bucket for the training job like below. Platform will automatically add <training_job_name>/tensorboard-output to the s3_output_path:
 default_bucket
   - default_bucket_prefix
    	- base_job_name
        	- base_job_name-<timestamp>
        		- tensorboard-output 
  • Default handling of CheckpointConfig - when provided a path like checkpoint_config.s3_uri = f"s3://{default_bucket}/{default_bucket_prefix}/{base_job_name}/{base_job_name-<timestamp>}/checkpoints, checkpoints will be uploaded to the artifacts bucket like below. Notice that Platform respects the s3_uri and does not append extra paths in output.
 default_bucket
   - default_bucket_prefix
    	- base_job_name
        	- base_job_name-<timestamp>
        		- checkpoints

Testing done:

  • Unit Tests + Manual testing

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)
  • If adding any dependency in requirements.txt files, I have spell checked and ensured they exist in PyPi

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@benieric benieric requested a review from a team as a code owner May 12, 2025 20:38
@benieric benieric requested a review from nargokul May 12, 2025 20:38
@benieric benieric changed the title change: Improve default handling in ModelTrainer change: Improve defaults handling in ModelTrainer May 12, 2025
@benieric benieric force-pushed the master-update-default-configs branch from 9a89f6a to 6c750be Compare May 13, 2025 00:11
@benieric benieric force-pushed the master-update-default-configs branch from 4d8b871 to 4a189d1 Compare May 13, 2025 00:15
@benieric benieric force-pushed the master-update-default-configs branch from 4a189d1 to 152a50c Compare May 13, 2025 00:16
pintaoz-aws
pintaoz-aws previously approved these changes May 13, 2025
@benieric benieric force-pushed the master-update-default-configs branch from df700bc to f5791ce Compare May 13, 2025 17:44
@benieric benieric force-pushed the master-update-default-configs branch from 5a3041f to a9a68ba Compare May 13, 2025 18:59
@pintaoz-aws pintaoz-aws merged commit c849eae into aws:master May 15, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants