Skip to content

Commit 7a91c3f

Browse files
authored
Merge branch 'master' into fix-disable-profiler-settings
2 parents b3846a9 + f15343c commit 7a91c3f

22 files changed

+173
-169
lines changed

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,16 @@
11
# Changelog
22

3+
## v2.24.5 (2021-02-12)
4+
5+
### Bug Fixes and Other Changes
6+
7+
* test_tag/test_tags method assert fix in association tests
8+
9+
### Documentation Changes
10+
11+
* removing mention of TF 2.4 from SM distributed model parallel docs
12+
* adding details about mpi options, other small updates
13+
314
## v2.24.4 (2021-02-09)
415

516
### Bug Fixes and Other Changes

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.24.5.dev0
1+
2.24.6.dev0

buildspec-deploy.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,16 @@ version: 0.2
33
phases:
44
build:
55
commands:
6-
- PACKAGE_FILE="$CODEBUILD_SRC_DIR_ARTIFACT_1/sagemaker-*.tar.gz"
6+
# prepare the release (update versions, changelog etc.)
7+
- git-release --prepare
8+
9+
# generate the distribution package
10+
- python3 setup.py sdist
11+
12+
# publish the release to github
13+
- git-release --publish
14+
15+
- PACKAGE_FILE="dist/sagemaker-*.tar.gz"
716
- PYPI_USER=$(aws secretsmanager get-secret-value --secret-id /codebuild/pypi/user --query SecretString --output text)
817
- PYPI_PASSWORD=$(aws secretsmanager get-secret-value --secret-id /codebuild/pypi/password --query SecretString --output text)
918
- GPG_PRIVATE_KEY=$(aws secretsmanager get-secret-value --secret-id /codebuild/gpg/private_key --query SecretString --output text)

buildspec-release.yml

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@ version: 0.2
33
phases:
44
build:
55
commands:
6-
# prepare the release (update versions, changelog etc.)
7-
- git-release --prepare
8-
96
# run linters
107
- tox -e flake8,pylint
118

@@ -22,15 +19,3 @@ phases:
2219

2320
# run a subset of the integration tests
2421
- IGNORE_COVERAGE=- tox -e py36 -- tests/integ -m canary_quick -n 64 --boxed --reruns 2
25-
26-
# generate the distribution package
27-
- python3 setup.py sdist
28-
29-
# publish the release to github
30-
- git-release --publish
31-
32-
artifacts:
33-
files:
34-
- dist/sagemaker-*.tar.gz
35-
name: ARTIFACT_1
36-
discard-paths: yes

doc/api/training/smd_model_parallel_general.rst

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55

66
.. _sm-sdk-modelparallel-params:
77

8-
SageMaker Python SDK ``modelparallel`` parameters
9-
=================================================
8+
Required SageMaker Python SDK parameters
9+
========================================
1010

1111
The TensorFlow and PyTorch ``Estimator`` objects contains a ``distribution`` parameter,
1212
which is used to enable and specify parameters for the
1313
initialization of the SageMaker distributed model parallel library. The library internally uses MPI,
14-
so in order to use model parallelism, MPI must be enabled using the ``distribution`` parameter.
14+
so in order to use model parallelism, MPI must also be enabled using the ``distribution`` parameter.
1515

1616
The following is an example of how you can launch a new PyTorch training job with the library.
1717

@@ -55,6 +55,9 @@ The following is an example of how you can launch a new PyTorch training job wit
5555
5656
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
5757
58+
``smdistributed`` Parameters
59+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
60+
5861
You can use the following parameters to initialize the library using the ``parameters``
5962
in the ``smdistributed`` of ``distribution``.
6063

@@ -302,6 +305,41 @@ table are optional.
302305
| | | | SageMaker. |
303306
+-------------------+-------------------------+-----------------+-----------------------------------+
304307

308+
``mpi`` Parameters
309+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
310+
For the ``"mpi"`` key, a dict must be passed which contains:
311+
312+
* ``"enabled"``: Set to ``True`` to launch the training job with MPI.
313+
314+
* ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host.
315+
In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker distributed model parallel library maintains
316+
a one-to-one mapping between processes and GPUs across model and data parallelism.
317+
This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.
318+
If you are using PyTorch, you must restrict each process to its own device using
319+
``torch.cuda.set_device(smp.local_rank())``. To learn more, see
320+
`Modify a PyTorch Training Script
321+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_.
322+
323+
.. important::
324+
``process_per_host`` must be less than or equal to the number of GPUs per instance, and typically will be equal to
325+
the number of GPUs per instance.
326+
327+
For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,
328+
then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,
329+
such as an ml.p3.16xlarge.
330+
331+
The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs:
332+
the model is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
333+
334+
.. image:: smp_versions/model-data-parallel.png
335+
:width: 650
336+
:alt: 2-way data parallelism and 4-way model parallelism distributed across 8 GPUs
337+
338+
339+
* ``"custom_mpi_options"``: Use this key to pass any custom MPI options you might need.
340+
To avoid Docker warnings from contaminating your training logs, we recommend the following flag.
341+
```--mca btl_vader_single_copy_mechanism none```
342+
305343

306344
.. _ranking-basics:
307345

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,6 @@
1717

1818
- Adds support for `_register_comm_hook` (PyTorch 1.7 only) which will register the callable as a communication hook for DDP. NOTE: Like in DDP, this is an experimental API and subject to change.
1919

20-
### Tensorflow
21-
22-
- Adds support for Tensorflow 2.4
23-
2420
## Bug Fixes
2521

2622
### PyTorch
Loading

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,9 @@ The following SageMaker distribute model parallel APIs are common across all fra
118118
- https://www.tensorflow.org/api_docs/python/tf/function\
119119
- https://www.tensorflow.org/guide/function\
120120

121+
Each ``smp.step`` decorated function must have a return value that depends on the
122+
output of ``smp.DistributedModel``.
123+
121124
**Common parameters**
122125

123126
- ``non_split_inputs`` (``list``): The list of arguments to the decorated function

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@ This API document assumes you use the following import statements in your traini
3131
model in the training script can be wrapped with
3232
``smp.DistributedModel``.
3333

34-
3534
**Example:**
3635

3736
.. code:: python
@@ -89,6 +88,17 @@ This API document assumes you use the following import statements in your traini
8988
the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
9089
a ``smp.step``-decorated function.
9190

91+
**Using DDP**
92+
93+
If DDP is enabled, do not not place a PyTorch
94+
``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
95+
the ``DistributedModel`` wrapper will also handle data parallelism.
96+
97+
Unlike the original DDP wrapper, when you use ``DistributedModel``,
98+
model parameters and buffers are not immediately broadcast across
99+
processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
100+
``smp.step``-decorated function when the partition is done.
101+
92102
**Parameters**
93103

94104
- ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
@@ -248,11 +258,14 @@ This API document assumes you use the following import statements in your traini
248258
.. function:: join( )
249259

250260
**Available for PyTorch 1.7 only**
261+
251262
A context manager to be used in conjunction with an instance of
252-
``smp.DistributedModel``to be able to train with uneven inputs across
263+
``smp.DistributedModel`` to be able to train with uneven inputs across
253264
participating processes. This is only supported when ``ddp=True`` for
254265
``smp.DistributedModel``. This will use the join with the wrapped
255-
``DistributedDataParallel`` instance. Please see: `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__.
266+
``DistributedDataParallel`` instance. For more information, see:
267+
`join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
268+
in the PyTorch documentation.
256269

257270

258271
.. class:: smp.DistributedOptimizer

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
TensorFlow API
22
==============
33

4-
**Supported version: 2.4, 2.3**
4+
**Supported version: 2.3**
55

66
**Important**: This API document assumes you use the following import statement in your training scripts.
77

src/sagemaker/analytics.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ class AnalyticsMetricsBase(with_metaclass(ABCMeta, object)):
4343
"""
4444

4545
def __init__(self):
46+
"""Initializes ``AnalyticsMetricsBase`` instance."""
4647
self._dataframe = None
4748

4849
def export_csv(self, filename):

src/sagemaker/image_uri_config/inferentia-mxnet.json

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,29 @@
55
"1.5.1": {
66
"py_versions": ["py3"],
77
"registries": {
8+
"af-south-1": "774647643957",
9+
"ap-east-1": "110948597952",
10+
"ap-northeast-1": "941853720454",
11+
"ap-northeast-2": "151534178276",
12+
"ap-south-1": "763008648453",
13+
"ap-southeast-1": "324986816169",
14+
"ap-southeast-2": "355873309152",
15+
"ca-central-1": "464438896020",
16+
"cn-north-1": "472730292857",
17+
"cn-northwest-1": "474822919863",
18+
"eu-central-1": "746233611703",
19+
"eu-north-1": "601324751636",
20+
"eu-south-1": "966458181534",
21+
"eu-west-1": "802834080501",
22+
"eu-west-2": "205493899709",
23+
"eu-west-3": "254080097072",
24+
"me-south-1": "836785723513",
25+
"sa-east-1": "756306329178",
826
"us-east-1": "785573368785",
9-
"us-west-2": "301217895009"
27+
"us-east-2": "007439368137",
28+
"us-gov-west-1": "263933020539",
29+
"us-west-1": "710691900526",
30+
"us-west-2": "301217895009"
1031
},
1132
"repository": "sagemaker-neo-mxnet"
1233
}

src/sagemaker/image_uri_config/inferentia-pytorch.json

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,28 @@
55
"1.5.1": {
66
"py_versions": ["py3"],
77
"registries": {
8+
"af-south-1": "774647643957",
9+
"ap-east-1": "110948597952",
10+
"ap-northeast-1": "941853720454",
11+
"ap-northeast-2": "151534178276",
12+
"ap-south-1": "763008648453",
13+
"ap-southeast-1": "324986816169",
14+
"ap-southeast-2": "355873309152",
15+
"ca-central-1": "464438896020",
16+
"cn-north-1": "472730292857",
17+
"cn-northwest-1": "474822919863",
18+
"eu-central-1": "746233611703",
19+
"eu-north-1": "601324751636",
20+
"eu-south-1": "966458181534",
21+
"eu-west-1": "802834080501",
22+
"eu-west-2": "205493899709",
23+
"eu-west-3": "254080097072",
24+
"me-south-1": "836785723513",
25+
"sa-east-1": "756306329178",
826
"us-east-1": "785573368785",
27+
"us-east-2": "007439368137",
28+
"us-gov-west-1": "263933020539",
29+
"us-west-1": "710691900526",
930
"us-west-2": "301217895009"
1031
},
1132
"repository": "sagemaker-neo-pytorch"

src/sagemaker/image_uri_config/inferentia-tensorflow.json

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,29 @@
55
"1.15.0": {
66
"py_versions": ["py3"],
77
"registries": {
8+
"af-south-1": "774647643957",
9+
"ap-east-1": "110948597952",
10+
"ap-northeast-1": "941853720454",
11+
"ap-northeast-2": "151534178276",
12+
"ap-south-1": "763008648453",
13+
"ap-southeast-1": "324986816169",
14+
"ap-southeast-2": "355873309152",
15+
"ca-central-1": "464438896020",
16+
"cn-north-1": "472730292857",
17+
"cn-northwest-1": "474822919863",
18+
"eu-central-1": "746233611703",
19+
"eu-north-1": "601324751636",
20+
"eu-south-1": "966458181534",
21+
"eu-west-1": "802834080501",
22+
"eu-west-2": "205493899709",
23+
"eu-west-3": "254080097072",
24+
"me-south-1": "836785723513",
25+
"sa-east-1": "756306329178",
826
"us-east-1": "785573368785",
9-
"us-west-2": "301217895009"
27+
"us-east-2": "007439368137",
28+
"us-gov-west-1": "263933020539",
29+
"us-west-1": "710691900526",
30+
"us-west-2": "301217895009"
1031
},
1132
"repository": "sagemaker-neo-tensorflow"
1233
}

tests/integ/sagemaker/lineage/test_action.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -90,8 +90,8 @@ def test_tag(action_obj, sagemaker_session):
9090
)["Tags"]
9191
if actual_tags:
9292
break
93-
# When sagemaker-client-config endpoint-url is passed as argument to hit beta,
94-
# length of actual tags will be 2
93+
# When sagemaker-client-config endpoint-url is passed as argument to hit some endpoints,
94+
# length of actual tags will be greater than 1
9595
assert len(actual_tags) > 0
9696
assert actual_tags[0] == tag
9797

@@ -106,7 +106,7 @@ def test_tags(action_obj, sagemaker_session):
106106
)["Tags"]
107107
if actual_tags:
108108
break
109-
# When sagemaker-client-config endpoint-url is passed as argument to hit beta,
110-
# length of actual tags will be 2
109+
# When sagemaker-client-config endpoint-url is passed as argument to hit some endpoints,
110+
# length of actual tags will be greater than 1
111111
assert len(actual_tags) > 0
112112
assert [actual_tags[-1]] == tags

tests/integ/sagemaker/lineage/test_artifact.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -121,8 +121,8 @@ def test_tag(artifact_obj, sagemaker_session):
121121
)["Tags"]
122122
if actual_tags:
123123
break
124-
# When sagemaker-client-config endpoint-url is passed as argument to hit beta,
125-
# length of actual tags will be 2
124+
# When sagemaker-client-config endpoint-url is passed as argument to hit some endpoints,
125+
# length of actual tags will be greater than 1
126126
assert len(actual_tags) > 0
127127
assert actual_tags[0] == tag
128128

@@ -137,7 +137,7 @@ def test_tags(artifact_obj, sagemaker_session):
137137
)["Tags"]
138138
if actual_tags:
139139
break
140-
# When sagemaker-client-config endpoint-url is passed as argument to hit beta,
141-
# length of actual tags will be 2
140+
# When sagemaker-client-config endpoint-url is passed as argument to hit some endpoints,
141+
# length of actual tags will be greater than 1
142142
assert len(actual_tags) > 0
143143
assert [actual_tags[-1]] == tags

tests/integ/sagemaker/lineage/test_association.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,9 @@ def test_set_tag(association_obj, sagemaker_session):
6666
if actual_tags:
6767
break
6868
time.sleep(1)
69-
assert len(actual_tags) == 1
69+
# When sagemaker-client-config endpoint-url is passed as argument to hit some endpoints,
70+
# length of actual tags will be greater than 1
71+
assert len(actual_tags) > 0
7072
assert actual_tags[0] == tag
7173

7274

@@ -81,5 +83,7 @@ def test_tags(association_obj, sagemaker_session):
8183
if actual_tags:
8284
break
8385
time.sleep(1)
84-
assert len(actual_tags) == 1
85-
assert actual_tags == tags
86+
# When sagemaker-client-config endpoint-url is passed as argument to hit some endpoints,
87+
# length of actual tags will be greater than 1
88+
assert len(actual_tags) > 0
89+
assert [actual_tags[-1]] == tags

tests/integ/sagemaker/lineage/test_context.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -88,8 +88,8 @@ def test_tag(context_obj, sagemaker_session):
8888
)["Tags"]
8989
if actual_tags:
9090
break
91-
# When sagemaker-client-config endpoint-url is passed as argument to hit beta,
92-
# length of actual tags will be 2
91+
# When sagemaker-client-config endpoint-url is passed as argument to hit some endpoints,
92+
# length of actual tags will be greater than 1
9393
assert len(actual_tags) > 0
9494
assert actual_tags[0] == tag
9595

@@ -104,7 +104,7 @@ def test_tags(context_obj, sagemaker_session):
104104
)["Tags"]
105105
if actual_tags:
106106
break
107-
# When sagemaker-client-config endpoint-url is passed as argument to hit beta,
108-
# length of actual tags will be 2
107+
# When sagemaker-client-config endpoint-url is passed as argument to hit some endpoints,
108+
# length of actual tags will be greater than 1
109109
assert len(actual_tags) > 0
110110
assert [actual_tags[-1]] == tags

0 commit comments

Comments
 (0)