Skip to content

Commit 438c147

Browse files
authored
Merging the current draft of docs (aws#56)
* WIP docs * WIP * Finished common_api.md * WIP * MXNet first pass * WIP docs * Address some comments, consolidate summary and glossary into README * Address comments * Address comments * Address some comments * WIP default collections * More docs * Docs * WIP * Ready for first merge * Details about JSON file * Address some of Rahul's comments, format markdown with python * Highlights section * Sagemaker first * Remove json spec * Typo * docs * SageMaker ZCC front and center * explain zcc * Docs for Trial, Tensor, Rule (aws#45) * Remove sagemaker docs, Update parts of Rules readme with trial info * Trial, Tensor, Rules APIs * Undo code change in this PR * Update tensors method doc * Try to fix anchor links * Fix anchor links * How to fix indentation in markdown? * Update links markdown * Change typing for method doc * change name of dict * Reduce size of TOC header * move file * update tensor_names method
1 parent edddaa2 commit 438c147

16 files changed

+1388
-2233
lines changed

docs/rules/README.md

Lines changed: 0 additions & 483 deletions
This file was deleted.

documentation/API.md

Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
2+
# Common API
3+
These objects exist across all frameworks.
4+
- [SageMaker Zero-Code-Change vs. Python API](#sagemaker)
5+
- [Creating a Hook](#creating-a-hook)
6+
- [Hook from SageMaker](#hook-from-sagemaker)
7+
- [Hook from Python](#hook-from-python)
8+
- [Modes](#modes)
9+
- [Collection](#collection)
10+
- [SaveConfig](#saveconfig)
11+
- [ReductionConfig](#reductionconfig)
12+
13+
---
14+
## SageMaker Zero-Code-Change vs. Python API
15+
16+
There are two ways to use sagemaker-debugger: SageMaker Zero-Code-Change or Python API.
17+
18+
SageMaker Zero-Code-Change will use a custom framework fork to automatically instantiate the hook, register tensors, and create collections.
19+
All you need to do is decide which built-in rules to use. Further documentation is available on [AWS Docs](https://link.com).
20+
```python
21+
import sagemaker
22+
from sagemaker.debugger import rule_configs, Rule, CollectionConfig, DebuggerHookConfig, TensorBoardOutputConfig
23+
24+
hook_config = DebuggerHookConfig(
25+
s3_output_path = args.s3_path,
26+
container_local_path = args.local_path,
27+
hook_parameters = {
28+
"save_steps": "0,20,40,60,80"
29+
},
30+
collection_configs = {
31+
{ "CollectionName": "weights" },
32+
{ "CollectionName": "biases" },
33+
},
34+
)
35+
36+
rule = Rule.sagemaker(
37+
rule_configs.exploding_tensor(),
38+
rule_parameters={
39+
"tensor_regex": ".*"
40+
},
41+
collections_to_save=[
42+
CollectionConfig(name="weights"),
43+
CollectionConfig(name="losses"),
44+
],
45+
)
46+
47+
sagemaker_simple_estimator = sagemaker.tensorflow.TensorFlow(
48+
entry_point="script.py",
49+
role=sagemaker.get_execution_role(),
50+
framework_version="1.15",
51+
py_version="py3",
52+
rules=[rule],
53+
debugger_hook_config=hook_config,
54+
)
55+
56+
sagemaker_simple_estimator.fit()
57+
```
58+
59+
The Python API requires more configuration but is also more flexible. You must write your own custom rules
60+
instead of using SageMaker's built-in rules, but you can use it with a custom container in SageMaker or in your own
61+
environment. It is described further below.
62+
63+
64+
---
65+
66+
## Creating a Hook
67+
68+
### Hook from SageMaker
69+
If you create a SageMaker job and specify the hook configuration in the SageMaker Estimator API
70+
as described in [AWS Docs](https://link.com),
71+
the a JSON file will be automatically written. You can create a hook from this file by calling
72+
```python
73+
hook = smd.{hook_class}.create_from_json_file()
74+
```
75+
with no arguments and then use the hook Python API in your script. `hook_class` will be `Hook` for PyTorch, MXNet, and XGBoost. It will be one of `KerasHook`, `SessionHook`, `EstimatorHook` for TensorFlow.
76+
77+
### Hook from Python
78+
See the framework-specific pages for more details.
79+
* [TensorFlow](https://link.com)
80+
* [PyTorch](https://link.com)
81+
* [MXNet](https://link.com)
82+
* [XGBoost](https://link.com)
83+
84+
---
85+
86+
## Modes
87+
Used to signify which part of training you're in, similar to Keras modes. `GLOBAL` mode is used as
88+
a default. Choose from
89+
```python
90+
smd.modes.TRAIN
91+
smd.modes.EVAL
92+
smd.modes.PREDICT
93+
smd.modes.GLOBAL
94+
```
95+
96+
---
97+
98+
## Collection
99+
100+
The Collection object groups tensors such as "losses", "weights", "biases", or "gradients".
101+
A collection has its own list of tensors, include/exclude regex patterns, reduction config and save config.
102+
This allows setting of different save and reduction configs for different tensors.
103+
These collections are then also available during analysis.
104+
105+
You can choose which of these builtin collections (or define your own) to save in the hook's `include_collections` parameter. By default, only a few collections are saved.
106+
107+
| Framework | include_collections (default) |
108+
|---|---|
109+
| `TensorFlow` | METRICS, LOSSES, SEARCHABLE_SCALARS |
110+
| `PyTorch` | LOSSES, SCALARS |
111+
| `MXNet` | LOSSES, SCALARS |
112+
| `XGBoost` | METRICS |
113+
114+
Each framework has pre-defined settings for certain collections. For example, TensorFlow's KerasHook
115+
will automatically place weights into the `smd.CollectionKeys.WEIGHTS` collection. PyTorch uses the regex
116+
`"^(?!gradient).*weight` to automatically place tensors in the weights collection.
117+
118+
| CollectionKey | Frameworks | Description |
119+
|---|---|---|
120+
| `ALL` | all | Saves all tensors. |
121+
| `DEFAULT` | all | ??? |
122+
| `WEIGHTS` | TensorFlow, PyTorch, MXNet | Matches all weights tensors. |
123+
| `BIASES` | TensorFlow, PyTorch, MXNet | Matches all biases tensors. |
124+
| `GRADIENTS` | TensorFlow, PyTorch, MXNet | Matches all gradients tensors. In TensorFlow non-DLC, must use `hook.wrap_optimizer()`. |
125+
| `LOSSES` | TensorFlow, PyTorch, MXNet | Matches all loss tensors. |
126+
| `SCALARS` | TensorFlow, PyTorch, MXNet | Matches all scalar tensors, such as loss or accuracy. |
127+
| `METRICS` | TensorFlow, XGBoost | ??? |
128+
| `INPUTS` | TensorFlow | Matches all inputs to a layer (outputs of the previous layer). |
129+
| `OUTPUTS` | TensorFlow | Matches all outputs of a layer (inputs of the following layer). |
130+
| `SEARCHABLE_SCALARS` | TensorFlow | Scalars that will go to SageMaker Metrics. |
131+
| `OPTIMIZER_VARIABLES` | TensorFlow | Matches all optimizer variables. |
132+
| `HYPERPARAMETERS` | XGBoost | ... |
133+
| `PREDICTIONS` | XGBoost | ... |
134+
| `LABELS` | XGBoost | ... |
135+
| `FEATURE_IMPORTANCE` | XGBoost | ... |
136+
| `AVERAGE_SHAP` | XGBoost | ... |
137+
| `FULL_SHAP` | XGBoost | ... |
138+
| `TREES` | XGBoost | ... |
139+
140+
141+
142+
143+
```python
144+
coll = smd.Collection(
145+
name,
146+
include_regex = None,
147+
tensor_names = None,
148+
reduction_config = None,
149+
save_config = None,
150+
save_histogram = True,
151+
)
152+
```
153+
`name` (str): Used to identify the collection.\
154+
`include_regex` (list[str]): The regexes to match tensor names for the collection.\
155+
`tensor_names` (list[str]): A list of tensor names to include.\
156+
`reduction_config`: (ReductionConfig object): Which reductions to store in the collection.\
157+
`save_config` (SaveConfig object): Settings for how often to save the collection.\
158+
`save_histogram` (bool): Whether to save histogram data for the collection. Only used if tensorboard support is enabled. Not computed for scalar collections such as losses.
159+
160+
### Accessing a Collection
161+
162+
| Function | Behavior |
163+
|---|---|
164+
| ```hook.get_collection(collection_name)``` | Returns the collection with the given name. Creates the collection with default settings if it doesn't already exist. |
165+
| ```hook.get_collections()``` | Returns all collections as a dictionary with the keys being names of the collections. |
166+
| ```hook.add_to_collection(collection_name, args)``` | Equivalent to calling `coll.add(args)` on the collection with name `collection_name`. |
167+
168+
### Properties of a Collection
169+
| Property | Description |
170+
|---|---|
171+
| `tensor_names` | Get or set list of tensor names as strings. |
172+
| `include_regex` | Get or set list of regexes to include. |
173+
| `reduction_config` | Get or set the ReductionConfig object. |
174+
| `save_config` | Get or set the SaveConfig object. |
175+
176+
177+
### Methods on a Collection
178+
179+
| Method | Behavior |
180+
|---|---|
181+
| ```coll.include(regex)``` | Takes a regex string or a list of regex strings to match tensors to include in the collection. |
182+
| ```coll.add(tensor)``` | **(TensorFlow only)** Takes an instance or list or set of tf.Tensor/tf.Variable/tf.MirroredVariable/tf.Operation to add to the collection. |
183+
| ```coll.add_keras_layer(layer, inputs=False, outputs=True)``` | **(tf.keras only)** Takes an instance of a tf.keras layer and logs input/output tensors for that module. By default, only outputs are saved. |
184+
| ```coll.add_module_tensors(module, inputs=False, outputs=True)``` | **(PyTorch only)** Takes an instance of a PyTorch module and logs input/output tensors for that module. By default, only outputs are saved. |
185+
| ```coll.add_block_tensors(block, inputs=False, outputs=True)``` | **(MXNet only)** Takes an instance of a Gluon block,and logs input/output tensors for that module. By default, only outputs are saved. |
186+
187+
---
188+
189+
## SaveConfig
190+
The SaveConfig class customizes the frequency of saving tensors.
191+
The hook takes a SaveConfig object which is applied as default to all tensors included.
192+
A collection can also have a SaveConfig object which is applied to the collection's tensors.
193+
194+
SaveConfig also allows you to save tensors when certain tensors become nan.
195+
This list of tensors to watch for is taken as a list of strings representing names of tensors.
196+
197+
```python
198+
save_config = smd.SaveConfig(
199+
mode_save_configs = None,
200+
save_interval = 100,
201+
start_step = 0,
202+
end_step = None,
203+
save_steps = None,
204+
)
205+
```
206+
`mode_save_configs` (dict): Used for advanced cases; see details below.\
207+
`save_interval` (int): How often, in steps, to save tensors. Defaults to 100. \
208+
`start_step` (int): When to start saving tensors.\
209+
`end_step` (int): When to stop saving tensors, exclusive.\
210+
`save_steps` (list[int]): Specific steps to save tensors at. Union with all other parameters.
211+
212+
For example,
213+
214+
`SaveConfig()` will save at steps [0, 100, ...].\
215+
`SaveConfig(save_interval=1)` will save at steps [0, 1, ...]\
216+
`SaveConfig(save_interval=100, end_step=200)` will save at steps [0, 200].\
217+
`SaveConfig(save_interval=100, end_step=201)` will save at steps [0, 100, 200].\
218+
`SaveConfig(save_interval=100, start_step=150)` will save at steps [200, 300, ...].\
219+
`SaveConfig(save_steps=[3, 7])` will save at steps [3, 7].
220+
221+
There is also a more advanced use case, where you specify a different SaveConfig for each mode.
222+
It is best understood through an example:
223+
```python
224+
SaveConfig(mode_save_configs={
225+
smd.modes.TRAIN: smd.SaveConfigMode(save_interval=1),
226+
smd.modes.EVAL: smd.SaveConfigMode(save_interval=2),
227+
smd.modes.PREDICT: smd.SaveConfigMode(save_interval=3),
228+
smd.modes.GLOBAL: smd.SaveConfigMode(save_interval=4)
229+
})
230+
```
231+
Essentially, create a dictionary mapping modes to SaveConfigMode objects. The SaveConfigMode objects
232+
take the same four parameters (save_interval, start_step, end_step, save_steps) as the main object.
233+
Any mode not specified will default to the default configuration. If a mode is provided but not all
234+
params are specified, we use the default values for non-specified parameters.
235+
236+
---
237+
238+
## ReductionConfig
239+
ReductionConfig allows the saving of certain reductions of tensors instead
240+
of saving the full tensor. The motivation here is to reduce the amount of data
241+
saved, and increase the speed in cases where you don't need the full
242+
tensor. The reduction operations which are computed in the training process
243+
and then saved.
244+
245+
During analysis, these are available as reductions of the original tensor.
246+
Please note that using reduction config means that you will not have
247+
the full tensor available during analysis, so this can restrict what you can do with the tensor saved.
248+
The hook takes a ReductionConfig object which is applied as default to all tensors included.
249+
A collection can also have its own ReductionConfig object which is applied
250+
to the tensors belonging to that collection.
251+
252+
```python
253+
reduction_config = smd.ReductionConfig(
254+
reductions = None,
255+
abs_reductions = None,
256+
norms = None,
257+
abs_norms = None,
258+
save_raw_tensor = False,
259+
)
260+
```
261+
`reductions` (list[str]): Takes names of reductions, choosing from "min", "max", "median", "mean", "std", "variance", "sum", "prod".\
262+
`abs_reductions` (list[str]): Same as reductions, except the reduction will be computed on the absolute value of the tensor.\
263+
`norms` (list[str]): Takes names of norms to compute, choosing from "l1", "l2".\
264+
`abs_norms` (list[str]): Same as norms, except the norm will be computed on the absolute value of the tensor.\
265+
`save_raw_tensor` (bool): Saves the tensor directly, in addition to other desired reductions.
266+
267+
For example,
268+
269+
`ReductionConfig(reductions=['std', 'variance'], abs_reductions=['mean'], norms=['l1'])`
270+
271+
will return the standard deviation and variance, the mean of the absolute value, and the l1 norm.

0 commit comments

Comments
 (0)