Skip to content

Commit 7a2c466

Browse files
mchoi8739ChoiByungWook
authored andcommitted
documentation: update documentation for the new SageMaker Debugger APIs (#477)
1 parent 1f26aee commit 7a2c466

File tree

10 files changed

+678
-293
lines changed

10 files changed

+678
-293
lines changed

doc/api/training/debugger.rst

Lines changed: 75 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,79 @@
11
Debugger
22
--------
33

4-
.. automodule:: sagemaker.debugger
5-
:members:
6-
:undoc-members:
4+
Amazon SageMaker Debugger provides full visibility
5+
into training jobs of state-of-the-art machine learning models.
6+
This SageMaker Debugger module provides high-level methods
7+
to set up Debugger configurations to
8+
monitor, profile, and debug your training job.
9+
Configure the Debugger-specific parameters when constructing
10+
a SageMaker estimator to gain visibility and insights
11+
into your training job.
12+
13+
.. currentmodule:: sagemaker.debugger
14+
15+
.. autoclass:: get_rule_container_image_uri
16+
:show-inheritance:
17+
18+
.. autoclass:: get_default_profiler_rule
19+
:show-inheritance:
20+
21+
.. class:: sagemaker.debugger.rule_configs
22+
23+
A helper module to configure the SageMaker Debugger built-in rules with
24+
the :class:`~sagemaker.debugger.Rule` classmethods and
25+
and the :class:`~sagemaker.debugger.ProfilerRule` classmethods.
26+
27+
For a full list of built-in rules, see
28+
`List of Debugger Built-in Rules
29+
<https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html>`_.
30+
31+
This module is imported from the Debugger client library for rule configuration.
32+
For more information, see
33+
`Amazon SageMaker Debugger RulesConfig
34+
<https://github.com/awslabs/sagemaker-debugger-rulesconfig>`_.
35+
36+
.. autoclass:: RuleBase
37+
:show-inheritance:
38+
39+
.. autoclass:: Rule
40+
:show-inheritance:
41+
:inherited-members:
42+
43+
.. autoclass:: ProfilerRule
44+
:show-inheritance:
45+
:inherited-members:
46+
47+
.. autoclass:: CollectionConfig
48+
:show-inheritance:
49+
50+
.. autoclass:: DebuggerHookConfig
751
:show-inheritance:
52+
53+
.. autoclass:: TensorBoardOutputConfig
54+
:show-inheritance:
55+
56+
.. autoclass:: ProfilerConfig
57+
:show-inheritance:
58+
59+
.. autoclass:: FrameworkProfile
60+
:show-inheritance:
61+
62+
.. autoclass:: DetailedProfilingConfig
63+
:show-inheritance:
64+
65+
.. autoclass:: DataloaderProfilingConfig
66+
:show-inheritance:
67+
68+
.. autoclass:: PythonProfilingConfig
69+
:show-inheritance:
70+
71+
.. autoclass:: PythonProfiler
72+
:show-inheritance:
73+
74+
.. autoclass:: cProfileTimer
75+
:show-inheritance:
76+
77+
.. automodule:: sagemaker.debugger.metrics_config
78+
:members: StepRange, TimeRange
79+
:undoc-members:

src/sagemaker/debugger/debugger.py

Lines changed: 154 additions & 86 deletions
Large diffs are not rendered by default.

src/sagemaker/debugger/framework_profile.py

Lines changed: 116 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,87 @@
3838

3939

4040
class FrameworkProfile:
41-
"""Configuration for the collection of framework metrics in the profiler.
41+
"""
42+
Sets up the profiling configuration for framework metrics.
43+
44+
Validates user inputs and fills in default values if no input is provided.
45+
There are three main profiling options to choose from:
46+
:class:`~sagemaker.debugger.metrics_config.DetailedProfilingConfig`,
47+
:class:`~sagemaker.debugger.metrics_config.DataloaderProfilingConfig`, and
48+
:class:`~sagemaker.debugger.metrics_config.PythonProfilingConfig`.
49+
50+
The following list shows available scenarios of configuring the profiling options.
51+
52+
1. None of the profiling configuration, step range, or time range is specified.
53+
SageMaker Debugger activates framework profiling based on the default settings
54+
of each profiling option.
55+
56+
.. code-block:: python
57+
58+
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
59+
60+
profiler_config=ProfilerConfig(
61+
framework_profile_params=FrameworkProfile()
62+
)
63+
64+
2. Target step or time range is specified to
65+
this :class:`~sagemaker.debugger.metrics_config.FrameworkProfile` class.
66+
The requested target step or time range setting propagates to all of
67+
the framework profiling options.
68+
For example, if you configure this class as following, all of the profiling options
69+
profiles the 6th step:
70+
71+
.. code-block:: python
72+
73+
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
74+
75+
profiler_config=ProfilerConfig(
76+
framework_profile_params=FrameworkProfile(start_step=6, num_steps=1)
77+
)
78+
79+
3. Individual profiling configurations are specified through
80+
the ``*_profiling_config`` parameters.
81+
SageMaker Debugger profiles framework metrics only for the specified profiling configurations.
82+
For example, if the :class:`~sagemaker.debugger.metrics_config.DetailedProfilingConfig` class
83+
is configured but not the other profiling options, Debugger only profiles based on the settings
84+
specified to the
85+
:class:`~sagemaker.debugger.metrics_config.DetailedProfilingConfig` class.
86+
For example, the following example shows a profiling configuration to perform
87+
detailed profiling at step 10, data loader profiling at step 9 and 10,
88+
and Python profiling at step 12.
89+
90+
.. code-block:: python
91+
92+
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
93+
94+
profiler_config=ProfilerConfig(
95+
framework_profile_params=FrameworkProfile(
96+
detailed_profiling_config=DetailedProfilingConfig(start_step=10, num_steps=1),
97+
dataloader_profiling_config=DataloaderProfilingConfig(start_step=9, num_steps=2),
98+
python_profiling_config=PythonProfilingConfig(start_step=12, num_steps=1),
99+
)
100+
)
101+
102+
If the individual profiling configurations are specified in addition to
103+
the step or time range,
104+
SageMaker Debugger prioritizes the individual profiling configurations and ignores
105+
the step or time range. For example, in the following code,
106+
the ``start_step=1`` and ``num_steps=10`` will be ignored.
107+
108+
.. code-block:: python
109+
110+
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
111+
112+
profiler_config=ProfilerConfig(
113+
framework_profile_params=FrameworkProfile(
114+
start_step=1,
115+
num_steps=10,
116+
detailed_profiling_config=DetailedProfilingConfig(start_step=10, num_steps=1),
117+
dataloader_profiling_config=DataloaderProfilingConfig(start_step=9, num_steps=2),
118+
python_profiling_config=PythonProfilingConfig(start_step=12, num_steps=1)
119+
)
120+
)
42121
43-
Validates user input and fills in default values wherever necessary.
44122
"""
45123

46124
def __init__(
@@ -59,41 +137,34 @@ def __init__(
59137
start_unix_time=None,
60138
duration=None,
61139
):
62-
"""Set up the profiling configuration for framework metrics based on user input.
63-
64-
There are three main options for the user to choose from.
65-
1. No custom metrics configs or step range or time range specified. Default profiling is
66-
done for each set of framework metrics.
67-
2. Custom metrics configs are specified. Do profiling for the metrics whose configs are
68-
specified and no profiling for the rest of the metrics.
69-
3. Custom step range or time range is specified. Profiling for all of the metrics will
70-
occur with the provided step/time range. Configs with additional parameters beyond
71-
step/time range will use defaults for those additional parameters.
72-
73-
If custom metrics configs are specified in addition to step or time range being specified,
74-
then we ignore the step/time range and default to using custom metrics configs.
140+
"""Initialize the FrameworkProfile class object.
75141
76142
Args:
77-
local_path (str): The path where profiler events have to be saved.
78-
file_max_size (int): Max size a trace file can be, before being rotated.
79-
file_close_interval (float): Interval in seconds from the last close, before being
80-
rotated.
81-
file_open_fail_threshold (int): Number of times to attempt to open a trace fail before
82-
marking the writer as unhealthy.
83143
detailed_profiling_config (DetailedProfilingConfig): The configuration for detailed
84-
profiling done by the framework.
85-
dataloader_profiling_config (DataloaderProfilingConfig): The configuration for metrics
86-
collected in the data loader.
144+
profiling. Configure it using the
145+
:class:`~sagemaker.debugger.metrics_config.DetailedProfilingConfig` class.
146+
Pass ``DetailedProfilingConfig()`` to use the default configuration.
147+
dataloader_profiling_config (DataloaderProfilingConfig): The configuration for
148+
dataloader metrics profiling. Configure it using the
149+
:class:`~sagemaker.debugger.metrics_config.DataloaderProfilingConfig` class.
150+
Pass ``DataloaderProfilingConfig()`` to use the default configuration.
87151
python_profiling_config (PythonProfilingConfig): The configuration for stats
88152
collected by the Python profiler (cProfile or Pyinstrument).
89-
horovod_profiling_config (HorovodProfilingConfig): The configuration for metrics
90-
collected by horovod when using horovod for distributed training.
91-
smdataparallel_profiling_config (SMDataParallelProfilingConfig): The configuration for
92-
metrics collected by SageMaker Distributed training.
153+
Configure it using the
154+
:class:`~sagemaker.debugger.metrics_config.PythonProfilingConfig` class.
155+
Pass ``PythonProfilingConfig()`` to use the default configuration.
93156
start_step (int): The step at which to start profiling.
94157
num_steps (int): The number of steps to profile.
95-
start_unix_time (int): The UNIX time at which to start profiling.
96-
duration (float): The duration in seconds to profile for.
158+
start_unix_time (int): The Unix time at which to start profiling.
159+
duration (float): The duration in seconds to profile.
160+
161+
.. tip::
162+
Available profiling range parameter pairs are
163+
(**start_step** and **num_steps**) and (**start_unix_time** and **duration**).
164+
The two parameter pairs are mutually exclusive, and this class validates
165+
if one of the two pairs is used. If both pairs are specified, a
166+
conflict error occurs.
167+
97168
"""
98169
self.profiling_parameters = {}
99170
self._use_default_metrics_configs = False
@@ -132,6 +203,7 @@ def _process_trace_file_parameters(
132203
rotated.
133204
file_open_fail_threshold (int): Number of times to attempt to open a trace fail before
134205
marking the writer as unhealthy.
206+
135207
"""
136208
assert isinstance(local_path, str), ErrorMessages.INVALID_LOCAL_PATH.value
137209
assert (
@@ -152,13 +224,17 @@ def _process_trace_file_parameters(
152224
def _process_metrics_configs(self, *metrics_configs):
153225
"""Helper function to validate and set the provided metrics_configs.
154226
155-
In this case, the user specifies configs for the metrics they want profiled.
156-
Profiling does not occur for metrics if configs are not specified for them.
227+
In this case,
228+
the user specifies configurations for the metrics they want to profile.
229+
Profiling does not occur
230+
for metrics if the configurations are not specified for them.
157231
158232
Args:
159233
metrics_configs: The list of metrics configs specified by the user.
234+
160235
Returns:
161-
bool: Whether custom metrics configs will be used for profiling.
236+
bool: Indicates whether custom metrics configs will be used for profiling.
237+
162238
"""
163239
metrics_configs = [config for config in metrics_configs if config is not None]
164240
if len(metrics_configs) == 0:
@@ -173,16 +249,19 @@ def _process_metrics_configs(self, *metrics_configs):
173249
def _process_range_fields(self, start_step, num_steps, start_unix_time, duration):
174250
"""Helper function to validate and set the provided range fields.
175251
176-
Profiling will occur for all of the metrics using these fields as the specified
177-
range and default parameters for the rest of the config fields (if necessary).
252+
Profiling occurs
253+
for all of the metrics using these fields as the specified range and default parameters
254+
for the rest of the configuration fields (if necessary).
178255
179256
Args:
180257
start_step (int): The step at which to start profiling.
181258
num_steps (int): The number of steps to profile.
182259
start_unix_time (int): The UNIX time at which to start profiling.
183-
duration (float): The duration in seconds to profile for.
260+
duration (float): The duration in seconds to profile.
261+
184262
Returns:
185-
bool: Whether custom step or time range will be used for profiling.
263+
bool: Indicates whether a custom step or time range will be used for profiling.
264+
186265
"""
187266
if start_step is num_steps is start_unix_time is duration is None:
188267
return False

0 commit comments

Comments
 (0)