-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feature: support RetryStrategy for training jobs #2316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
206c807
864e08b
31f1f21
e439f8a
99967f9
462e1ef
111e5b1
de9286d
fa26334
6926ff8
934c312
4d66a7d
34579c5
8d7ace6
a0c1923
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -124,6 +124,7 @@ def __init__( | |
profiler_config=None, | ||
disable_profiler=False, | ||
environment=None, | ||
max_retry_attempts=None, | ||
**kwargs, | ||
): | ||
"""Initialize an ``EstimatorBase`` instance. | ||
|
@@ -269,6 +270,13 @@ def __init__( | |
will be disabled (default: ``False``). | ||
environment (dict[str, str]) : Environment variables to be set for | ||
use during training job (default: ``None``) | ||
max_retry_attempts (int): The number of times to move a job to the STARTING status. | ||
You can specify between 1 and 30 attempts. | ||
If the value of attempts is greater than zero, | ||
the job is retried on InternalServerFailure | ||
the same number of attempts as the value. | ||
You can cap the total duration for your job by setting ``max_wait`` and ``max_run`` | ||
(default: ``None``) | ||
|
||
""" | ||
instance_count = renamed_kwargs( | ||
|
@@ -357,6 +365,8 @@ def __init__( | |
|
||
self.environment = environment | ||
|
||
self.max_retry_attempts = max_retry_attempts | ||
|
||
if not _region_supports_profiler(self.sagemaker_session.boto_region_name): | ||
self.disable_profiler = True | ||
|
||
|
@@ -1114,6 +1124,13 @@ def _prepare_init_params_from_job_description(cls, job_details, model_channel_na | |
if max_wait: | ||
init_params["max_wait"] = max_wait | ||
|
||
if job_details.get("RetryStrategy", False): | ||
init_params["max_retry_attempts"] = job_details.get("RetryStrategy", {}).get( | ||
"MaximumRetryAttempts" | ||
) | ||
max_wait = job_details.get("StoppingCondition", {}).get("MaxWaitTimeInSeconds") | ||
if max_wait: | ||
init_params["max_wait"] = max_wait | ||
return init_params | ||
|
||
def transformer( | ||
|
@@ -1489,6 +1506,11 @@ def _get_train_args(cls, estimator, inputs, experiment_config): | |
if estimator.enable_network_isolation(): | ||
train_args["enable_network_isolation"] = True | ||
|
||
if estimator.max_retry_attempts is not None: | ||
train_args["retry_strategy"] = {"MaximumRetryAttempts": estimator.max_retry_attempts} | ||
else: | ||
train_args["retry_strategy"] = None | ||
|
||
if estimator.encrypt_inter_container_traffic: | ||
train_args["encrypt_inter_container_traffic"] = True | ||
|
||
|
@@ -1666,6 +1688,7 @@ def __init__( | |
profiler_config=None, | ||
disable_profiler=False, | ||
environment=None, | ||
max_retry_attempts=None, | ||
**kwargs, | ||
): | ||
"""Initialize an ``Estimator`` instance. | ||
|
@@ -1816,6 +1839,13 @@ def __init__( | |
will be disabled (default: ``False``). | ||
environment (dict[str, str]) : Environment variables to be set for | ||
use during training job (default: ``None``) | ||
max_retry_attempts (int): The number of times to move a job to the STARTING status. | ||
You can specify between 1 and 30 attempts. | ||
If the value of attempts is greater than zero, | ||
the job is retried on InternalServerFailure | ||
the same number of attempts as the value. | ||
You can cap the total duration for your job by setting ``max_wait`` and ``max_run`` | ||
(default: ``None``) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please explain here that if this is set to None there will be no retries. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i am wondering if we should set the default to 3 or 2. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can set the default to 3 or 2 as we want to customers to explicitly opt-in. However I think I need to re-word it like " If the value of attempts is greater than zero, the job is retried on InternalServerFailure the same number of attempts as the value. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, that is more clear. Let's just do None for the default for now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
""" | ||
self.image_uri = image_uri | ||
self.hyperparam_dict = hyperparameters.copy() if hyperparameters else {} | ||
|
@@ -1850,6 +1880,7 @@ def __init__( | |
profiler_config=profiler_config, | ||
disable_profiler=disable_profiler, | ||
environment=environment, | ||
max_retry_attempts=max_retry_attempts, | ||
**kwargs, | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to set this to None? Can we just leave it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per this test, if I have to assert that retry_strategy is None when max_retry_attempts are not set, then yes we have to explicitly set to None https://github.com/aws/sagemaker-python-sdk/blob/master/tests/unit/test_estimator.py#L1060 . given this my understanding is that we have to set it to None explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, k