Skip to content

Commit 1e6b944

Browse files
authored
Merge branch 'master' into master
2 parents d2a8160 + 44fbcc9 commit 1e6b944

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+843
-608
lines changed

CHANGELOG.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,68 @@
11
# Changelog
22

3+
## v2.35.0 (2021-04-14)
4+
5+
### Features
6+
7+
* add support for PyTorch 1.8.1
8+
9+
### Bug Fixes and Other Changes
10+
11+
* boto3 client param updated for feature store
12+
* Updated release notes and API doc for smd model parallel 1.3.1
13+
14+
## v2.34.0 (2021-04-12)
15+
16+
### Features
17+
18+
* Add support for accelerator in Clarify
19+
20+
### Bug Fixes and Other Changes
21+
22+
* add Documentation for how to use
23+
* enable local mode tests that were skipped
24+
* add integ test for HuggingFace with TensorFlow
25+
26+
### Documentation Changes
27+
28+
* release notes for smdistributed.dataparallel v1.1.1
29+
* fixing the SageMaker distributed version references
30+
31+
### Testing and Release Infrastructure
32+
33+
* pin version for ducutils
34+
35+
## v2.33.0 (2021-04-05)
36+
37+
### Features
38+
39+
* Add environment variable support for SageMaker training job
40+
41+
### Bug Fixes and Other Changes
42+
43+
* add version length mismatch validation for HuggingFace
44+
* Disable debugger when checkpointing is enabled with distributed training
45+
* map user context is list associations response
46+
47+
### Testing and Release Infrastructure
48+
49+
* disable_profiler on mx-horovod test
50+
51+
## v2.32.1 (2021-04-01)
52+
53+
### Bug Fixes and Other Changes
54+
55+
* disable profiler in some release tests
56+
* remove outdated notebook from test
57+
* add compilation option for ml_eia2
58+
* add short version to smdataparallel supported list
59+
60+
### Documentation Changes
61+
62+
* creating a "latest" version sm distributed docs
63+
* add docs for Sagemaker Model Parallel 1.3, released with PT 1.8
64+
* update PyTorch version in doc
65+
366
## v2.32.0 (2021-03-26)
467

568
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.32.1.dev0
1+
2.35.1.dev0

doc/amazon_sagemaker_featurestore.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ use the SageMaker default bucket and add a custom prefix to it.
6767
offline_feature_store_bucket = 's3://*{}*/*{}*'.format(default_bucket, prefix)
6868
6969
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
70-
featurestore_runtime = boto_session.client(service_name='featurestore-runtime', region_name=region)
70+
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)
7171
7272
feature_store_session = Session(
7373
    boto_session=boto_session,

doc/api/training/sdp_versions/latest.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
Version 1.1.0 (Latest)
2+
Version 1.1.1 (Latest)
33
======================
44

55
.. toctree::

doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -153,9 +153,9 @@ you will have for distributed training with the distributed data parallel librar
153153
PyTorch API
154154
===========
155155

156-
**Supported versions:**
156+
.. rubric:: Supported versions
157157

158-
- PyTorch 1.6.0, 1.8.0
158+
**PyTorch 1.7.1, 1.8.0**
159159

160160

161161
.. function:: smdistributed.dataparallel.torch.distributed.is_available()

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ The following steps show you how to convert a TensorFlow 2.x training
1616
script to utilize the distributed data parallel library.
1717

1818
The distributed data parallel library APIs are designed to be close to Horovod APIs.
19-
See `SageMaker distributed data parallel TensorFlow examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#tensorflow-distributed>`__ for additional details on how to implement the data parallel library
20-
API offered for TensorFlow.
19+
See `SageMaker distributed data parallel TensorFlow examples
20+
<https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#tensorflow-distributed>`__
21+
for additional details on how to implement the data parallel library.
2122

2223
- First import the distributed data parallel library’s TensorFlow client and initialize it:
2324

@@ -156,8 +157,10 @@ TensorFlow API
156157

157158
.. rubric:: Supported versions
158159

159-
- TensorFlow 2.x - 2.3.1
160-
160+
TensorFlow is supported in version 1.0.0 of ``sagemakerdistributed.dataparallel``.
161+
Reference version 1.0.0 `TensorFlow API documentation
162+
<https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.html#tensorflow-sdp-api>`_
163+
for supported TensorFlow versions.
161164

162165
.. function:: smdistributed.dataparallel.tensorflow.init()
163166

doc/api/training/sdp_versions/v1.0.0/smd_data_parallel_pytorch.rst

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ PyTorch Guide to SageMaker's distributed data parallel library
44

55
.. admonition:: Contents
66

7-
- :ref:`pytorch-sdp-modify`
8-
- :ref:`pytorch-sdp-api`
7+
- :ref:`pytorch-sdp-modify-1.0.0`
8+
- :ref:`pytorch-sdp-api-1.0.0`
99

10-
.. _pytorch-sdp-modify:
11-
:noindex:
10+
.. _pytorch-sdp-modify-1.0.0:
1211

1312
Modify a PyTorch training script to use SageMaker data parallel
1413
======================================================================
@@ -149,15 +148,14 @@ you will have for distributed training with the distributed data parallel librar
149148
    main()
150149
151150
152-
.. _pytorch-sdp-api:
153-
:noindex:
151+
.. _pytorch-sdp-api-1.0.0:
154152

155153
PyTorch API
156154
===========
157155

158-
**Supported versions:**
156+
.. rubric:: Supported versions
159157

160-
- PyTorch 1.6.0
158+
**PyTorch 1.6.0, 1.7.1**
161159

162160

163161
.. function:: smdistributed.dataparallel.torch.distributed.is_available()

doc/api/training/sdp_versions/v1.0.0/smd_data_parallel_tensorflow.rst

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ TensorFlow Guide to SageMaker's distributed data parallel library
44

55
.. admonition:: Contents
66

7-
- :ref:`tensorflow-sdp-modify`
8-
- :ref:`tensorflow-sdp-api`
7+
- :ref:`tensorflow-sdp-modify-1.0.0`
8+
- :ref:`tensorflow-sdp-api-1.0.0`
99

10-
.. _tensorflow-sdp-modify:
11-
:noindex:
10+
.. _tensorflow-sdp-modify-1.0.0:
1211

1312
Modify a TensorFlow 2.x training script to use SageMaker data parallel
1413
======================================================================
@@ -150,15 +149,14 @@ script you will have for distributed training with the library.
150149
    checkpoint.save(checkpoint_dir)
151150
152151
153-
.. _tensorflow-sdp-api:
154-
:noindex:
152+
.. _tensorflow-sdp-api-1.0.0:
155153

156154
TensorFlow API
157155
==============
158156

159157
.. rubric:: Supported versions
160158

161-
- TensorFlow 2.x - 2.3.1
159+
**TensorFlow 2.3.x - 2.4.1**
162160

163161

164162
.. function:: smdistributed.dataparallel.tensorflow.init()

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.md

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,41 @@
1+
# Sagemaker Distributed Data Parallel 1.1.1 Release Notes
2+
3+
* New Features
4+
* Bug Fixes
5+
* Known Issues
6+
7+
*New Features:*
8+
9+
* Adds support for PyTorch 1.8.1
10+
11+
*Bug Fixes:*
12+
13+
* Fixes a bug that was causing gradients from one of the worker nodes to be added twice resulting in incorrect `all_reduce` results under some conditions.
14+
15+
*Known Issues:*
16+
17+
* SageMaker distributed data parallel still is not efficient when run using a single node. For the best performance, use multi-node distributed training with `smdistributed.dataparallel`. Use a single node only for experimental runs while preparing your training pipeline.
18+
119
# Sagemaker Distributed Data Parallel 1.1.0 Release Notes
220

321
* New Features
422
* Bug Fixes
523
* Improvements
624
* Known Issues
725

8-
New Features:
26+
*New Features:*
927

1028
* Adds support for PyTorch 1.8.0 with CUDA 11.1 and CUDNN 8
1129

12-
Bug Fixes:
30+
*Bug Fixes:*
1331

1432
* Fixes crash issue when importing `smdataparallel` before PyTorch
1533

16-
Improvements:
34+
*Improvements:*
1735

1836
* Update `smdataparallel` name in python packages, descriptions, and log outputs
1937

20-
Known Issues:
38+
*Known Issues:*
2139

2240
* SageMaker DataParallel is not efficient when run using a single node. For the best performance, use multi-node distributed training with `smdataparallel`. Use a single node only for experimental runs while preparing your training pipeline.
2341

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,33 @@
1+
# Sagemaker Distributed Model Parallel 1.3.1 Release Notes
2+
3+
- New Features
4+
- Bug Fixes
5+
- Known Issues
6+
7+
## New Features
8+
9+
### TensorFlow
10+
11+
- Exposes a new decorator ``register_post_partition_hook``. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html) for more information.
12+
13+
## Bug Fixes
14+
15+
### PyTorch
16+
17+
- Improved memory efficiency when using active microbatches by clearing activations at end of each microbatch.
18+
19+
### TensorFlow
20+
21+
- Fixed issue that caused hangs when training some models with XLA enabled.
22+
23+
## Known Issues
24+
25+
### PyTorch
26+
27+
- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.
28+
29+
- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
30+
131
# Sagemaker Distributed Model Parallel 1.3.0 Release Notes
232

333
- New Features

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
PyTorch API
77
===========
88

9-
**Supported versions: 1.7.1, 1.8.0**
9+
**Supported versions: 1.6.0, 1.7.1, 1.8.0**
1010

1111
This API document assumes you use the following import statements in your training scripts.
1212

doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,21 @@ TensorFlow API
8383
    with smp.partition(3):
8484
        z = tf.reduce_sum(y)             # placed in partition 3
8585
86-
86+
87+
.. function:: register_post_partition_hook(hook)
88+
89+
Registers a callable ``hook`` to
90+
be executed after the model is partitioned. This is useful in situations
91+
where an operation needs to be executed after the model partition during
92+
the first call to ``smp.step``, but before the actual execution of the
93+
first forward pass.
94+
95+
.. code:: python
96+
97+
@smp.register_post_partition_hook
98+
def test_eager():
99+
# All statements here will be executed right after partition but before the first forward pass
100+
tf.print("Entered hook through eager context")
87101
88102
.. class:: smp.CheckpointManager
89103

@@ -102,13 +116,6 @@ TensorFlow API
102116
                      max_to_keep=None,
103117
                      checkpoint_name="ckpt")
104118
105-
106-
**Important:** ``smp.CheckpointManager.restore()`` must be called after
107-
the first training step. This is because the first call of the
108-
``smp.step`` function constructs and partitions the model, which must
109-
take place before the checkpoint restore. Calling it before the first
110-
``smp.step`` call might result in hangs or unexpected behavior.
111-
112119
**Parameters**
113120

114121
- ``checkpoint``: A `tf.train.Checkpoint
@@ -154,7 +161,8 @@ TensorFlow API
154161
.. code:: python
155162
156163
for step, inputs in enumerate(train_ds):
157-
    if step == 1:                    # NOTE: restore occurs on the second step
164+
    if step == 0:
158165
        ckpt_manager.restore()
159166
    loss = train_step(inputs)
160167
168+

doc/frameworks/huggingface/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ For general information about using the SageMaker Python SDK, see :ref:`overview
99
:maxdepth: 2
1010

1111
sagemaker.huggingface
12+
Use Hugging Face with the SageMaker Python SDK <https://huggingface.co/transformers/sagemaker.html>

doc/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
sphinx==3.1.1
22
sphinx-rtd-theme==0.5.0
3+
docutils==0.15.2

src/sagemaker/clarify.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@ def __init__(
123123
content_type=None,
124124
content_template=None,
125125
custom_attributes=None,
126+
accelerator_type=None,
126127
):
127128
"""Initializes a configuration of a model and the endpoint to be created for it.
128129
@@ -151,6 +152,9 @@ def __init__(
151152
Section 3.3.6. Field Value Components (
152153
https://tools.ietf.org/html/rfc7230#section-3.2.6) of the Hypertext Transfer
153154
Protocol (HTTP/1.1).
155+
accelerator_type (str): The Elastic Inference accelerator type to deploy to the model
156+
endpoint instance for making inferences to the model, see
157+
https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html.
154158
"""
155159
self.predictor_config = {
156160
"model_name": model_name,
@@ -178,9 +182,8 @@ def __init__(
178182
f" Please include a placeholder $features."
179183
)
180184
self.predictor_config["content_template"] = content_template
181-
182-
if custom_attributes is not None:
183-
self.predictor_config["custom_attributes"] = custom_attributes
185+
_set(custom_attributes, "custom_attributes", self.predictor_config)
186+
_set(accelerator_type, "accelerator_type", self.predictor_config)
184187

185188
def get_predictor_config(self):
186189
"""Returns part of the predictor dictionary of the analysis config."""

0 commit comments

Comments
 (0)