Fix model_dir adjustment for hyperparameter tuning jobs #181

laurenyu · 2019-04-20T23:40:13Z

Description of changes:
This is an improvement upon #179. This PR also relies on a corresponding change to SageMaker Containers (aws/sagemaker-containers#186), but this PR can be merged independently (i.e. the fix won't take effect, but the resulting image will still work as it does today).

A couple caveats:

I manually tested with a single-machine job, but haven't gotten a chance to test the MPI or parameter server code paths. (will update if/when that changes.)
Writing an integ test does depend on the SageMaker Containers change being released, so I'll do that in a separate PR.

And a few random notes about the code:

I used hardcoded slashes rather than os.path.join for S3 paths because S3 will uses slashes, regardless of the OS running the code to generate the path.
TrainingEnv.hyperparameters is read-only (code), so I couldn't just overwrite env.hyperparameters['model_dir'] (or, consequently, use env.to_cmd_args()).
the hyperparameter specific to Hyperparameter Tuning jobs is present only with framework.env.read_hyperparameters() - by the time the hyperparameters are parsed for env.hyperparameters, there is not a hyperparameter to indicate that the training job belongs to a hyperparameter tuning job.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

icywang86rui · 2019-04-22T18:27:36Z

src/sagemaker_tensorflow_container/training.py

@@ -125,7 +124,13 @@ def _wait_until_master_is_down(master):
            return


-def train(env):
+def _cmd_args(env, model_dir):


I am not sure how much value this function provides. I would just fold it in _run_worker.

it's used twice: once in _run_worker and once in _train

icywang86rui · 2019-04-22T18:29:27Z

src/sagemaker_tensorflow_container/training.py

+        model_dir = _model_dir_with_training_job(hyperparameters.get('model_dir'), env.job_name)
+        logger.info('Appending the training job name to model_dir: {}'.format(model_dir))
+    else:
+        model_dir = hyperparameters.get('model_dir')


It feels a little awkward for the normal training job case. We are getting model_dir from hyperparameters then later on sets hyperparameters['model_dir'] to the same value. I would just make model_dir optional in _run_worker. YMMV.

we would then have to call framework.env.read_hyperparameters() again (see my note in the description) - not sure if it's particularly expensive to do that?

how about we just pass the cmd_args in _run_worker instead of model_dir?

icywang86rui

ship

Fix model_dir adjustment for hyperparameter tuning jobs

6a4c9a4

laurenyu requested review from chuyang-deng and icywang86rui April 20, 2019 23:40

laurenyu mentioned this pull request Apr 21, 2019

Use specified args, entry point, and env vars when creating a runner aws/sagemaker-containers#186

Merged

icywang86rui reviewed Apr 22, 2019

View reviewed changes

pass cmd_args instead of model_dir

35752d9

icywang86rui approved these changes Apr 22, 2019

View reviewed changes

laurenyu merged commit ce47c76 into aws:script-mode Apr 22, 2019

laurenyu deleted the fix-model-dir branch April 22, 2019 22:34

laurenyu mentioned this pull request Apr 24, 2019

Add SageMaker integ test for hyperparameter tuning model_dir logic #183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix model_dir adjustment for hyperparameter tuning jobs #181

Fix model_dir adjustment for hyperparameter tuning jobs #181

Uh oh!

laurenyu commented Apr 20, 2019 •

edited

Loading

Uh oh!

icywang86rui Apr 22, 2019

Uh oh!

laurenyu Apr 22, 2019

Uh oh!

icywang86rui Apr 22, 2019

Uh oh!

laurenyu Apr 22, 2019

Uh oh!

icywang86rui Apr 22, 2019

Uh oh!

icywang86rui left a comment

Uh oh!

Uh oh!

Fix model_dir adjustment for hyperparameter tuning jobs #181

Fix model_dir adjustment for hyperparameter tuning jobs #181

Uh oh!

Conversation

laurenyu commented Apr 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icywang86rui Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

laurenyu Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

icywang86rui Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

laurenyu Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

icywang86rui Apr 22, 2019

Choose a reason for hiding this comment

Uh oh!

icywang86rui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

laurenyu commented Apr 20, 2019 •

edited

Loading