Skip to content

Commit b764c09

Browse files
authored
Merge branch 'master' into master
2 parents 44907ce + 442227b commit b764c09

28 files changed

+321
-261
lines changed

doc/api/prep_data/feature_store.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,10 @@ Inputs
7373
:members:
7474
:show-inheritance:
7575

76+
.. autoclass:: sagemaker.feature_store.inputs.TableFormatEnum
77+
:members:
78+
:show-inheritance:
79+
7680

7781
Dataset Builder
7882
***************

doc/api/training/sdp_versions/latest.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ depending on the version of the library you use.
2626
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
2727
for more information.
2828

29-
Version 1.4.0, 1.4.1, 1.5.0 (Latest)
30-
====================================
29+
Version 1.4.0, 1.4.1, 1.5.0, 1.6.0 (Latest)
30+
===========================================
3131

3232
.. toctree::
3333
:maxdepth: 1

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst

Lines changed: 43 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,51 @@ Release Notes
77
New features, bug fixes, and improvements are regularly made to the SageMaker
88
distributed data parallel library.
99

10-
SageMaker Distributed Data Parallel 1.5.0 Release Notes
10+
SageMaker Distributed Data Parallel 1.6.0 Release Notes
1111
=======================================================
1212

13+
*Date: Dec. 15. 2022*
14+
15+
**New Features**
16+
17+
* New optimized SMDDP AllGather collective to complement the sharded data parallelism technique
18+
in the SageMaker model parallelism library. For more information, see `Sharded data parallelism with SMDDP Collectives
19+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives>`_
20+
in the *Amazon SageMaker Developer Guide*.
21+
* Added support for Amazon EC2 ``ml.p4de.24xlarge`` instances. You can run data parallel training jobs
22+
on ``ml.p4de.24xlarge`` instances with the SageMaker data parallelism library’s AllReduce collective.
23+
24+
**Improvements**
25+
26+
* General performance improvements of the SMDDP AllReduce collective communication operation.
27+
28+
**Migration to AWS Deep Learning Containers**
29+
30+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
31+
32+
- SageMaker training container for PyTorch v1.12.1
33+
34+
.. code::
35+
36+
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
37+
38+
39+
Binary file of this version of the library for `custom container
40+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-bring-your-own-container>`_ users:
41+
42+
.. code::
43+
44+
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.1/cu113/2022-12-05/smdistributed_dataparallel-1.6.0-cp38-cp38-linux_x86_64.whl
45+
46+
47+
----
48+
49+
Release History
50+
===============
51+
52+
SageMaker Distributed Data Parallel 1.5.0 Release Notes
53+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
54+
1355
*Date: Jul. 26. 2022*
1456

1557
**Currency Updates**
@@ -38,12 +80,6 @@ Binary file of this version of the library for `custom container
3880
3981
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed_dataparallel-1.5.0-cp38-cp38-linux_x86_64.whl
4082
41-
42-
----
43-
44-
Release History
45-
===============
46-
4783
SageMaker Distributed Data Parallel 1.4.1 Release Notes
4884
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4985

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

Lines changed: 53 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,60 @@ New features, bug fixes, and improvements are regularly made to the SageMaker
66
distributed model parallel library.
77

88

9-
SageMaker Distributed Model Parallel 1.11.0 Release Notes
9+
SageMaker Distributed Model Parallel 1.13.0 Release Notes
1010
=========================================================
1111

12+
*Date: Dec. 15. 2022*
13+
14+
**New Features**
15+
16+
* Sharded data parallelism now supports a new backend for collectives called *SMDDP Collectives*.
17+
For supported scenarios, SMDDP Collectives are on by default for the AllGather operation.
18+
For more information, see
19+
`Sharded data parallelism with SMDDP Collectives
20+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives>`_
21+
in the *Amazon SageMaker Developer Guide*.
22+
* Introduced FlashAttention for DistributedTransformer to improve memory usage and computational
23+
performance of models such as GPT2, GPTNeo, GPTJ, GPTNeoX, BERT, and RoBERTa.
24+
25+
**Bug Fixes**
26+
27+
* Fixed initialization of ``lm_head`` in DistributedTransformer to use a provided range
28+
for initialization, when weights are not tied with the embeddings.
29+
30+
**Improvements**
31+
32+
* When a module has no parameters, we have introduced an optimization to execute
33+
such a module on the same rank as its parent during pipeline parallelism.
34+
35+
**Migration to AWS Deep Learning Containers**
36+
37+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
38+
39+
- SageMaker training container for PyTorch v1.12.1
40+
41+
.. code::
42+
43+
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
44+
45+
46+
Binary file of this version of the library for `custom container
47+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
48+
49+
- For PyTorch 1.12.0
50+
51+
.. code::
52+
53+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.1/build-artifacts/2022-12-08-21-34/smdistributed_modelparallel-1.13.0-cp38-cp38-linux_x86_64.whl
54+
55+
----
56+
57+
Release History
58+
===============
59+
60+
SageMaker Distributed Model Parallel 1.11.0 Release Notes
61+
---------------------------------------------------------
62+
1263
*Date: August. 17. 2022*
1364

1465
**New Features**
@@ -41,12 +92,7 @@ Binary file of this version of the library for `custom container
4192

4293
.. code::
4394
44-
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed_modelparallel-1.11.0-cp38-cp38-linux_x86_64.whl
45-
46-
----
47-
48-
Release History
49-
===============
95+
https://sagemaker-distribu
5096
5197
SageMaker Distributed Model Parallel 1.10.1 Release Notes
5298
---------------------------------------------------------

doc/api/training/smp_versions/latest.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ depending on which version of the library you need to use.
1010
To use the library, reference the
1111
**Common API** documentation alongside the framework specific API documentation.
1212

13-
Version 1.11.0 (Latest)
14-
===========================================
13+
Version 1.11.0, 1.13.0 (Latest)
14+
===============================
1515

1616
To use the library, reference the Common API documentation alongside the framework specific API documentation.
1717

src/sagemaker/debugger/profiler_config.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ def __init__(
3232
s3_output_path: Optional[Union[str, PipelineVariable]] = None,
3333
system_monitor_interval_millis: Optional[Union[int, PipelineVariable]] = None,
3434
framework_profile_params: Optional[FrameworkProfile] = None,
35+
disable_profiler: Optional[Union[str, PipelineVariable]] = False,
3536
):
3637
"""Initialize a ``ProfilerConfig`` instance.
3738
@@ -78,6 +79,7 @@ class and SageMaker Framework estimators.
7879
self.s3_output_path = s3_output_path
7980
self.system_monitor_interval_millis = system_monitor_interval_millis
8081
self.framework_profile_params = framework_profile_params
82+
self.disable_profiler = disable_profiler
8183

8284
def _to_request_dict(self):
8385
"""Generate a request dictionary using the parameters provided when initializing the object.
@@ -91,6 +93,8 @@ def _to_request_dict(self):
9193
if self.s3_output_path is not None:
9294
profiler_config_request["S3OutputPath"] = self.s3_output_path
9395

96+
profiler_config_request["DisableProfiler"] = self.disable_profiler
97+
9498
if self.system_monitor_interval_millis is not None:
9599
profiler_config_request[
96100
"ProfilingIntervalInMilliseconds"

src/sagemaker/estimator.py

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -938,26 +938,29 @@ def _prepare_collection_configs(self):
938938
def _prepare_profiler_for_training(self):
939939
"""Set necessary values and do basic validations in profiler config and profiler rules.
940940
941-
When user explicitly set rules to an empty list, default profiler rule won't be enabled.
942-
Default profiler rule will be enabled in supported regions when either:
943-
1. user doesn't specify any rules, i.e., rules=None; or
944-
2. user only specify debugger rules, i.e., rules=[Rule.sagemaker(...)]
941+
No default profiler rule will be used. The user needs to specify rules explicitly
945942
"""
946943
if self.disable_profiler:
947-
if self.profiler_config:
948-
raise RuntimeError("profiler_config cannot be set when disable_profiler is True.")
944+
if self.profiler_config and not self.profiler_config.disable_profiler:
945+
raise RuntimeError(
946+
"profiler_config.disable_profiler cannot be False"
947+
+ " when disable_profiler is True."
948+
)
949949
if self.profiler_rules:
950950
raise RuntimeError("ProfilerRule cannot be set when disable_profiler is True.")
951951
elif _region_supports_profiler(self.sagemaker_session.boto_region_name):
952952
if self.profiler_config is None:
953953
self.profiler_config = ProfilerConfig(s3_output_path=self.output_path)
954954
if self.rules is None or (self.rules and not self.profiler_rules):
955-
self.profiler_rules = [get_default_profiler_rule()]
955+
self.profiler_rules = []
956956

957957
if self.profiler_config and not self.profiler_config.s3_output_path:
958958
self.profiler_config.s3_output_path = self.output_path
959959

960960
self.profiler_rule_configs = self._prepare_profiler_rules()
961+
# if profiler_config is still None, it means the job has profiler disabled
962+
if self.profiler_config is None:
963+
self.profiler_config = ProfilerConfig(disable_profiler=True)
961964

962965
def _prepare_profiler_rules(self):
963966
"""Set any necessary values in profiler rules, if they are provided."""
@@ -1048,7 +1051,7 @@ def latest_job_profiler_artifacts_path(self):
10481051
error_message="""Cannot get the profiling output artifacts path.
10491052
The Estimator is not associated with a training job."""
10501053
)
1051-
if self.profiler_config is not None:
1054+
if self.profiler_config is not None and not self.profiler_config.disable_profiler:
10521055
return os.path.join(
10531056
self.profiler_config.s3_output_path,
10541057
self.latest_training_job.name,
@@ -1895,8 +1898,8 @@ def enable_default_profiling(self):
18951898
else:
18961899
self.profiler_config = ProfilerConfig(s3_output_path=self.output_path)
18971900

1898-
self.profiler_rules = [get_default_profiler_rule()]
1899-
self.profiler_rule_configs = self._prepare_profiler_rules()
1901+
self.profiler_rules = []
1902+
self.profiler_rule_configs = []
19001903

19011904
_TrainingJob.update(
19021905
self, self.profiler_rule_configs, self.profiler_config._to_request_dict()

src/sagemaker/fw_utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@
8080
"ml.p3.16xlarge",
8181
"ml.p3dn.24xlarge",
8282
"ml.p4d.24xlarge",
83+
"ml.p4de.24xlarge",
8384
"local_gpu",
8485
)
8586
SM_DATAPARALLEL_SUPPORTED_FRAMEWORK_VERSIONS = {

src/sagemaker/session.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3336,6 +3336,11 @@ def create_endpoint_config_from_existing(
33363336
if request_data_capture_config_dict is not None:
33373337
request["DataCaptureConfig"] = request_data_capture_config_dict
33383338

3339+
if existing_endpoint_config_desc.get("AsyncInferenceConfig") is not None:
3340+
request["AsyncInferenceConfig"] = existing_endpoint_config_desc.get(
3341+
"AsyncInferenceConfig", None
3342+
)
3343+
33393344
self.sagemaker_client.create_endpoint_config(**request)
33403345

33413346
def create_endpoint(self, endpoint_name, config_name, tags=None, wait=True):

tests/integ/sagemaker/workflow/test_workflow.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1269,8 +1269,6 @@ def test_caching_behavior(
12691269
# create pipeline
12701270
pipeline.create(role)
12711271
definition = json.loads(pipeline.definition())
1272-
# delete profiler config for assertions as it will contain a timestamp
1273-
del definition["Steps"][1]["Arguments"]["ProfilerRuleConfigurations"]
12741272

12751273
# verify input path
12761274
expected_abalone_input_path = f"{pipeline_name}/{step_process.name}" f"/input/abalone_data"
@@ -1295,7 +1293,6 @@ def test_caching_behavior(
12951293

12961294
# verify no changes
12971295
definition2 = json.loads(pipeline.definition())
1298-
del definition2["Steps"][1]["Arguments"]["ProfilerRuleConfigurations"]
12991296
assert definition == definition2
13001297

13011298
# add dummy file to source_dir
@@ -1306,7 +1303,6 @@ def test_caching_behavior(
13061303

13071304
# verify changes
13081305
definition3 = json.loads(pipeline.definition())
1309-
del definition3["Steps"][1]["Arguments"]["ProfilerRuleConfigurations"]
13101306
assert definition != definition3
13111307

13121308
finally:

0 commit comments

Comments
 (0)