Skip to content

Commit c2d6ac8

Browse files
authored
Merge branch 'master' into feature-pipeline-display-name
2 parents 477f0c1 + fa1b292 commit c2d6ac8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+1882
-86
lines changed

.github/ISSUE_TEMPLATE/config.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
blank_issues_enabled: false
22
contact_links:
33
- name: Ask a question
4-
url: https://stackoverflow.com/questions/tagged/amazon-sagemaker
5-
about: Use Stack Overflow to ask and answer questions
4+
url: https://github.com/aws/sagemaker-python-sdk/discussions
5+
about: Use GitHub Discussions to ask and answer questions

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,34 @@
11
# Changelog
22

3+
## v2.41.0 (2021-05-17)
4+
5+
### Features
6+
7+
* add pipeline experiment config
8+
* add data wrangler processor
9+
* support RetryStrategy for training jobs
10+
11+
### Bug Fixes and Other Changes
12+
13+
* fix repack pipeline step by putting inference.py in "code" sub dir
14+
* add data wrangler image uri
15+
* fix black-check errors
16+
17+
## v2.40.0 (2021-05-11)
18+
19+
### Features
20+
21+
* add xgboost framework version 1.2-2
22+
23+
### Bug Fixes and Other Changes
24+
25+
* fix get_execution_role on Studio
26+
* [fix] Check py_version existence in RegisterModel step
27+
28+
### Documentation Changes
29+
30+
* SM Distributed EFA Launch
31+
332
## v2.39.1 (2021-05-05)
433

534
### Bug Fixes and Other Changes

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.39.2.dev0
1+
2.41.1.dev0

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -443,7 +443,7 @@ TensorFlow API
443443

444444
*  Supported compression types - ``none``, ``fp16``
445445

446-
- ``sparse_as_dense:`` Not supported. Raises not supported error.
446+
- ``sparse_as_dense:`` Treats sparse gradient tensor as dense tensor. Defaults to ``False``.
447447

448448
- ``op (smdistributed.dataparallel.tensorflow.ReduceOp)(optional)``: The reduction operation to combine tensors across different ranks. Defaults to ``Average`` if None is given.
449449

@@ -482,6 +482,8 @@ TensorFlow API
482482

483483
*  Supported compression types - ``none``, ``fp16``
484484

485+
- ``sparse_as_dense:`` Treats sparse gradient tensor as dense tensor. Defaults to ``False``.
486+
485487
- ``op (smdistributed.dataparallel.tensorflow.ReduceOp)(optional)``: The reduction operation to combine tensors across different ranks. Defaults to ``Average`` if None is given.
486488

487489
* Supported ops: ``AVERAGE``

doc/api/training/sdp_versions/v1.0.0/smd_data_parallel_tensorflow.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -456,7 +456,7 @@ TensorFlow API
456456

457457
*  Supported compression types - ``none``, ``fp16``
458458

459-
- ``sparse_as_dense:`` Not supported. Raises not supported error.
459+
- ``sparse_as_dense:`` Treats sparse gradient tensor as dense tensor. Defaults to ``False``.
460460

461461
- ``op (smdistributed.dataparallel.tensorflow.ReduceOp)(optional)``: The reduction operation to combine tensors across different ranks. Defaults to ``Average`` if None is given.
462462

@@ -496,6 +496,8 @@ TensorFlow API
496496

497497
*  Supported compression types - ``none``, ``fp16``
498498

499+
- ``sparse_as_dense:`` Treats sparse gradient tensor as dense tensor. Defaults to ``False``.
500+
499501
- ``op (smdistributed.dataparallel.tensorflow.ReduceOp)(optional)``: The reduction operation to combine tensors across different ranks. Defaults to ``Average`` if None is given.
500502

501503
* Supported ops: ``AVERAGE``

doc/api/training/sdp_versions/v1.1.x/smd_data_parallel_tensorflow.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -459,7 +459,7 @@ library with TensorFlow.
459459

460460
*  Supported compression types - ``none``, ``fp16``
461461

462-
- ``sparse_as_dense:`` Not supported. Raises not supported error.
462+
- ``sparse_as_dense:`` Treats sparse gradient tensor as dense tensor. Defaults to ``False``.
463463

464464
- ``op (smdistributed.dataparallel.tensorflow.ReduceOp)(optional)``: The reduction operation to combine tensors across different ranks. Defaults to ``Average`` if None is given.
465465

@@ -499,6 +499,8 @@ library with TensorFlow.
499499

500500
*  Supported compression types - ``none``, ``fp16``
501501

502+
- ``sparse_as_dense:`` Treats sparse gradient tensor as dense tensor. Defaults to ``False``.
503+
502504
- ``op (smdistributed.dataparallel.tensorflow.ReduceOp)(optional)``: The reduction operation to combine tensors across different ranks. Defaults to ``Average`` if None is given.
503505

504506
* Supported ops: ``AVERAGE``

doc/frameworks/sklearn/using_sklearn.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,10 @@ inadvertently run your training code at the wrong point in execution.
8484

8585
For more on training environment variables, please visit https://github.com/aws/sagemaker-containers.
8686

87+
.. important::
88+
The sagemaker-containers repository has been deprecated,
89+
however it is still used to define Scikit-learn and XGBoost environment variables.
90+
8791
Save the Model
8892
--------------
8993

doc/frameworks/xgboost/using_xgboost.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,10 @@ but you can access useful properties about the training environment through vari
8888

8989
For the exhaustive list of available environment variables, see the `SageMaker Containers documentation <https://github.com/aws/sagemaker-containers#list-of-provided-environment-variables-by-sagemaker-containers>`__.
9090

91+
.. important::
92+
The sagemaker-containers repository has been deprecated,
93+
however it is still used to define Scikit-learn and XGBoost environment variables.
94+
9195
Let's look at the main elements of the script. Starting with the ``__main__`` guard,
9296
use a parser to read the hyperparameters passed to the estimator when creating the training job.
9397
These hyperparameters are made available as arguments to our input script.

doc/overview.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -374,7 +374,7 @@ Here are examples of how to use Amazon FSx for Lustre as input for training:
374374
375375
file_system_input = FileSystemInput(file_system_id='fs-2',
376376
file_system_type='FSxLustre',
377-
directory_path='/fsx/tensorflow',
377+
directory_path='/<mount-id>/tensorflow',
378378
file_system_access_mode='ro')
379379
380380
# Start an Amazon SageMaker training job with FSx using the FileSystemInput class
@@ -394,7 +394,7 @@ Here are examples of how to use Amazon FSx for Lustre as input for training:
394394
395395
records = FileSystemRecordSet(file_system_id='fs-=2,
396396
file_system_type='FSxLustre',
397-
directory_path='/fsx/kmeans',
397+
directory_path='/<mount-id>/kmeans',
398398
num_records=784,
399399
feature_dim=784)
400400

src/sagemaker/estimator.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ def __init__(
124124
profiler_config=None,
125125
disable_profiler=False,
126126
environment=None,
127+
max_retry_attempts=None,
127128
**kwargs,
128129
):
129130
"""Initialize an ``EstimatorBase`` instance.
@@ -269,6 +270,13 @@ def __init__(
269270
will be disabled (default: ``False``).
270271
environment (dict[str, str]) : Environment variables to be set for
271272
use during training job (default: ``None``)
273+
max_retry_attempts (int): The number of times to move a job to the STARTING status.
274+
You can specify between 1 and 30 attempts.
275+
If the value of attempts is greater than zero,
276+
the job is retried on InternalServerFailure
277+
the same number of attempts as the value.
278+
You can cap the total duration for your job by setting ``max_wait`` and ``max_run``
279+
(default: ``None``)
272280
273281
"""
274282
instance_count = renamed_kwargs(
@@ -357,6 +365,8 @@ def __init__(
357365

358366
self.environment = environment
359367

368+
self.max_retry_attempts = max_retry_attempts
369+
360370
if not _region_supports_profiler(self.sagemaker_session.boto_region_name):
361371
self.disable_profiler = True
362372

@@ -1114,6 +1124,13 @@ def _prepare_init_params_from_job_description(cls, job_details, model_channel_na
11141124
if max_wait:
11151125
init_params["max_wait"] = max_wait
11161126

1127+
if job_details.get("RetryStrategy", False):
1128+
init_params["max_retry_attempts"] = job_details.get("RetryStrategy", {}).get(
1129+
"MaximumRetryAttempts"
1130+
)
1131+
max_wait = job_details.get("StoppingCondition", {}).get("MaxWaitTimeInSeconds")
1132+
if max_wait:
1133+
init_params["max_wait"] = max_wait
11171134
return init_params
11181135

11191136
def transformer(
@@ -1489,6 +1506,11 @@ def _get_train_args(cls, estimator, inputs, experiment_config):
14891506
if estimator.enable_network_isolation():
14901507
train_args["enable_network_isolation"] = True
14911508

1509+
if estimator.max_retry_attempts is not None:
1510+
train_args["retry_strategy"] = {"MaximumRetryAttempts": estimator.max_retry_attempts}
1511+
else:
1512+
train_args["retry_strategy"] = None
1513+
14921514
if estimator.encrypt_inter_container_traffic:
14931515
train_args["encrypt_inter_container_traffic"] = True
14941516

@@ -1666,6 +1688,7 @@ def __init__(
16661688
profiler_config=None,
16671689
disable_profiler=False,
16681690
environment=None,
1691+
max_retry_attempts=None,
16691692
**kwargs,
16701693
):
16711694
"""Initialize an ``Estimator`` instance.
@@ -1816,6 +1839,13 @@ def __init__(
18161839
will be disabled (default: ``False``).
18171840
environment (dict[str, str]) : Environment variables to be set for
18181841
use during training job (default: ``None``)
1842+
max_retry_attempts (int): The number of times to move a job to the STARTING status.
1843+
You can specify between 1 and 30 attempts.
1844+
If the value of attempts is greater than zero,
1845+
the job is retried on InternalServerFailure
1846+
the same number of attempts as the value.
1847+
You can cap the total duration for your job by setting ``max_wait`` and ``max_run``
1848+
(default: ``None``)
18191849
"""
18201850
self.image_uri = image_uri
18211851
self.hyperparam_dict = hyperparameters.copy() if hyperparameters else {}
@@ -1850,6 +1880,7 @@ def __init__(
18501880
profiler_config=profiler_config,
18511881
disable_profiler=disable_profiler,
18521882
environment=environment,
1883+
max_retry_attempts=max_retry_attempts,
18531884
**kwargs,
18541885
)
18551886

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
{
2+
"processing": {
3+
"versions": {
4+
"1.x": {
5+
"registries": {
6+
"af-south-1": "143210264188",
7+
"ap-east-1": "707077482487",
8+
"ap-northeast-1": "649008135260",
9+
"ap-northeast-2": "131546521161",
10+
"ap-south-1": "089933028263",
11+
"ap-southeast-1": "119527597002",
12+
"ap-southeast-2": "422173101802",
13+
"ca-central-1": "557239378090",
14+
"eu-central-1": "024640144536",
15+
"eu-north-1": "054986407534",
16+
"eu-south-1": "488287956546",
17+
"eu-west-1": "245179582081",
18+
"eu-west-2": "894491911112",
19+
"eu-west-3": "807237891255",
20+
"me-south-1": "376037874950",
21+
"sa-east-1": "424196993095",
22+
"us-east-1": "663277389841",
23+
"us-east-2": "415577184552",
24+
"us-west-1": "926135532090",
25+
"us-west-2": "174368400705",
26+
"cn-north-1": "245909111842",
27+
"cn-northwest-1": "249157047649"
28+
},
29+
"repository": "sagemaker-data-wrangler-container"
30+
}
31+
}
32+
}
33+
}

src/sagemaker/processing.py

Lines changed: 26 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@
3030
from sagemaker.local import LocalSession
3131
from sagemaker.utils import base_name_from_image, name_from_base
3232
from sagemaker.session import Session
33-
from sagemaker.network import NetworkConfig # noqa: F401 # pylint: disable=unused-import
3433
from sagemaker.workflow.properties import Properties
3534
from sagemaker.workflow.parameters import Parameter
3635
from sagemaker.workflow.entities import Expression
@@ -219,14 +218,14 @@ def _normalize_args(
219218
"""
220219
self._current_job_name = self._generate_current_job_name(job_name=job_name)
221220

222-
inputs_with_code = self._include_code_in_inputs(inputs, code)
221+
inputs_with_code = self._include_code_in_inputs(inputs, code, kms_key)
223222
normalized_inputs = self._normalize_inputs(inputs_with_code, kms_key)
224223
normalized_outputs = self._normalize_outputs(outputs)
225224
self.arguments = arguments
226225

227226
return normalized_inputs, normalized_outputs
228227

229-
def _include_code_in_inputs(self, inputs, _code):
228+
def _include_code_in_inputs(self, inputs, _code, _kms_key):
230229
"""A no op in the base class to include code in the processing job inputs.
231230
232231
Args:
@@ -235,6 +234,8 @@ def _include_code_in_inputs(self, inputs, _code):
235234
:class:`~sagemaker.processing.ProcessingInput` objects.
236235
_code (str): This can be an S3 URI or a local path to a file with the framework
237236
script to run (default: None). A no op in the base class.
237+
kms_key (str): The ARN of the KMS key that is used to encrypt the
238+
user code file (default: None).
238239
239240
Returns:
240241
list[:class:`~sagemaker.processing.ProcessingInput`]: inputs
@@ -528,7 +529,7 @@ def run(
528529
if wait:
529530
self.latest_job.wait(logs=logs)
530531

531-
def _include_code_in_inputs(self, inputs, code):
532+
def _include_code_in_inputs(self, inputs, code, kms_key=None):
532533
"""Converts code to appropriate input and includes in input list.
533534
534535
Side effects include:
@@ -541,12 +542,14 @@ def _include_code_in_inputs(self, inputs, code):
541542
:class:`~sagemaker.processing.ProcessingInput` objects.
542543
code (str): This can be an S3 URI or a local path to a file with the framework
543544
script to run (default: None).
545+
kms_key (str): The ARN of the KMS key that is used to encrypt the
546+
user code file (default: None).
544547
545548
Returns:
546549
list[:class:`~sagemaker.processing.ProcessingInput`]: inputs together with the
547550
code as `ProcessingInput`.
548551
"""
549-
user_code_s3_uri = self._handle_user_code_url(code)
552+
user_code_s3_uri = self._handle_user_code_url(code, kms_key)
550553
user_script_name = self._get_user_code_name(code)
551554

552555
inputs_with_code = self._convert_code_and_add_to_inputs(inputs, user_code_s3_uri)
@@ -567,14 +570,16 @@ def _get_user_code_name(self, code):
567570
code_url = urlparse(code)
568571
return os.path.basename(code_url.path)
569572

570-
def _handle_user_code_url(self, code):
573+
def _handle_user_code_url(self, code, kms_key=None):
571574
"""Gets the S3 URL containing the user's code.
572575
573576
Inspects the scheme the customer passed in ("s3://" for code in S3, "file://" or nothing
574577
for absolute or local file paths. Uploads the code to S3 if the code is a local file.
575578
576579
Args:
577580
code (str): A URL to the customer's code.
581+
kms_key (str): The ARN of the KMS key that is used to encrypt the
582+
user code file (default: None).
578583
579584
Returns:
580585
str: The S3 URL to the customer's code.
@@ -603,7 +608,7 @@ def _handle_user_code_url(self, code):
603608
code
604609
)
605610
)
606-
user_code_s3_uri = self._upload_code(code_path)
611+
user_code_s3_uri = self._upload_code(code_path, kms_key)
607612
else:
608613
raise ValueError(
609614
"code {} url scheme {} is not recognized. Please pass a file path or S3 url".format(
@@ -612,11 +617,13 @@ def _handle_user_code_url(self, code):
612617
)
613618
return user_code_s3_uri
614619

615-
def _upload_code(self, code):
620+
def _upload_code(self, code, kms_key=None):
616621
"""Uploads a code file or directory specified as a string and returns the S3 URI.
617622
618623
Args:
619624
code (str): A file or directory to be uploaded to S3.
625+
kms_key (str): The ARN of the KMS key that is used to encrypt the
626+
user code file (default: None).
620627
621628
Returns:
622629
str: The S3 URI of the uploaded file or directory.
@@ -630,7 +637,10 @@ def _upload_code(self, code):
630637
self._CODE_CONTAINER_INPUT_NAME,
631638
)
632639
return s3.S3Uploader.upload(
633-
local_path=code, desired_s3_uri=desired_s3_uri, sagemaker_session=self.sagemaker_session
640+
local_path=code,
641+
desired_s3_uri=desired_s3_uri,
642+
kms_key=kms_key,
643+
sagemaker_session=self.sagemaker_session,
634644
)
635645

636646
def _convert_code_and_add_to_inputs(self, inputs, s3_uri):
@@ -666,7 +676,9 @@ def _set_entrypoint(self, command, user_script_name):
666676
"""
667677
user_script_location = str(
668678
pathlib.PurePosixPath(
669-
self._CODE_CONTAINER_BASE_PATH, self._CODE_CONTAINER_INPUT_NAME, user_script_name
679+
self._CODE_CONTAINER_BASE_PATH,
680+
self._CODE_CONTAINER_INPUT_NAME,
681+
user_script_name,
670682
)
671683
)
672684
self.entrypoint = command + [user_script_location]
@@ -1066,7 +1078,10 @@ def _to_request_dict(self):
10661078
"""Generates a request dictionary using the parameters provided to the class."""
10671079

10681080
# Create the request dictionary.
1069-
s3_input_request = {"InputName": self.input_name, "AppManaged": self.app_managed}
1081+
s3_input_request = {
1082+
"InputName": self.input_name,
1083+
"AppManaged": self.app_managed,
1084+
}
10701085

10711086
if self.s3_input:
10721087
# Check the compression type, then add it to the dictionary.

0 commit comments

Comments
 (0)