Skip to content

Commit e0b9ddf

Browse files
authored
TF notebook (aws#163)
* Changing link of latest binaries for 0.3 (aws#122) * change link to binary and introduce latest * make container scripts working again * remove -U * fix path to ts binary in docker * log when single process is to stdout * uploaded sagemaker docs update analysis docs remove sagemaker docs update TF doc add sagemaker docs update api docs change link for rules binary add files from s3 bucket * refactor positions * minor changes * fix links in old examples * fix paths in integration tests * Update test_training_end.py * Update test_training_end.py * Update integration_testing_rules.py * bring back examples section in analysis readme * create sagemaker-notebooks directory * fix links * updated notebook for tf * fix name of rule * Delete README.md * remove rules scripts * Update tensorflow-simple.ipynb * Update tensorflow-simple.ipynb * add sagemaker args * add model dir to resnet * remove action style args in script and reindent * update resnet example * make num epochs take priority over num_batches * change name of tf notebook * Add updated sagemaker tf notebook * change scripts to include all scripts in tf examples * change names of estimators * update files
1 parent a251a1e commit e0b9ddf

File tree

12 files changed

+1272
-1073
lines changed

12 files changed

+1272
-1073
lines changed

bin/sagemaker-containers/tag_as_latest.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/usr/bin/env bash
22

3+
set -ex
4+
35
if [ -z "$1" ]; then echo "Pass the tag which should be made the latest tag" && exit 1; fi
46

57
for region in us-east-1 us-east-2 us-west-1 us-west-2 ap-south-1 ap-northeast-2 ap-southeast-1 ap-southeast-2 ap-northeast-1 ca-central-1 eu-central-1 eu-west-1 eu-west-2 eu-west-3 eu-north-1 sa-east-1

bin/upload_for_sagemaker.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ export AWS_PROFILE=removethissoitdoesntcrash
66
# API DOCS
77
aws s3 cp docs/mxnet/api.md s3://tornasole-external-preview-use1/frameworks/mxnet/
88
aws s3 cp docs/tensorflow/api.md s3://tornasole-external-preview-use1/frameworks/tensorflow/
9+
aws s3 cp docs/tensorflow/examples/sm_resnet50.md s3://tornasole-external-preview-use1/frameworks/tensorflow/
910
aws s3 cp docs/pytorch/api.md s3://tornasole-external-preview-use1/frameworks/pytorch/
1011

1112
# DEV GUIDES
@@ -16,11 +17,11 @@ aws s3 cp sagemaker-docs/DeveloperGuide_Rules.md s3://tornasole-external-preview
1617

1718
# MXNET EXAMPLES
1819
aws s3 sync examples/mxnet/sagemaker-notebooks s3://tornasole-external-preview-use1/frameworks/mxnet/examples/notebooks
19-
aws s3 cp examples/mxnet/scripts/mnist_mxnet.py s3://tornasole-external-preview-use1/frameworks/mxnet/examples/scripts
20+
aws s3 cp examples/mxnet/scripts/mnist_mxnet.py s3://tornasole-external-preview-use1/frameworks/mxnet/examples/scripts/
2021

2122
# TF EXAMPLES
2223
aws s3 sync examples/tensorflow/sagemaker-notebooks s3://tornasole-external-preview-use1/frameworks/tensorflow/examples/notebooks
23-
aws s3 cp examples/tensorflow/scripts/simple.py s3://tornasole-external-preview-use1/frameworks/tensorflow/examples/scripts
24+
aws s3 sync examples/tensorflow/scripts s3://tornasole-external-preview-use1/frameworks/tensorflow/examples/scripts
2425

2526
# PYTORCH EXAMPLES
2627
#aws s3 sync examples/pytorch s3://tornasole-external-preview-use1/frameworks/pytorch/examples

docs/tensorflow/examples/mnist.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Below we call out the changes for Tornasole in the above script and describe the
99

1010
**Importing TornasoleTF**
1111
```
12-
import tornasole_tf as ts
12+
import tornasole.tensorflow as ts
1313
```
1414
**Saving gradients**
1515

docs/tensorflow/examples/resnet50.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Below we call out the changes for Tornasole in the above script and describe the
1212

1313
**Importing TornasoleTF**
1414
```
15-
import tornasole_tf as ts
15+
import tornasole.tensorflow as ts
1616
```
1717
**Saving weights**
1818
```

docs/tensorflow/examples/simple.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Below we call out the changes for Tornasole in the above script and describe the
88

99
**Importing TornasoleTF**
1010
```
11-
import tornasole_tf as ts
11+
import tornasole.tensorflow as ts
1212
```
1313
**Saving all tensors**
1414
```
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# ResNet50 Imagenet Example
2+
We provide an example script `train_imagenet_resnet_hvd.py` which is a Tornasole-enabled TensorFlow training script for ResNet50/ImageNet.
3+
**Please note that this script needs a GPU**.
4+
It uses the Estimator interface of TensorFlow.
5+
Here we show different scenarios of how to use Tornasole to
6+
save different tensors during training for analysis.
7+
Below are listed the changes we made to integrate these different
8+
behaviors of Tornasole as well as example commands for you to try.
9+
10+
## Integrating Tornasole
11+
Below we call out the changes for Tornasole in the above script and describe them
12+
13+
**Importing TornasoleTF**
14+
```
15+
import tornasole.tensorflow as ts
16+
```
17+
**Saving weights**
18+
```
19+
include_collections.append('weights')
20+
```
21+
**Saving gradients**
22+
23+
We need to wrap our optimizer with TornasoleOptimizer, and use this optimizer to minimize loss.
24+
This will also enable us to access the gradients during analysis without having to identify which tensors out of the saved ones are the gradients.
25+
```
26+
opt = TornasoleOptimizer(opt)
27+
28+
include_collections.append('gradients')
29+
ts.TornasoleHook(..., include_collections=include_collections, ...)
30+
```
31+
**Saving relu activations by variable**
32+
```
33+
x = tf.nn.relu(x + shortcut)
34+
ts.add_to_collection('relu_activations', x)
35+
...
36+
include_collections.append('relu_activations')
37+
ts.TornasoleHook(..., include_collections=include_collections, ...)
38+
```
39+
**Saving relu activations as reductions**
40+
```
41+
42+
x = tf.nn.relu(x + shortcut)
43+
ts.add_to_collection('relu_activations', x)
44+
...
45+
rnc = ts.ReductionConfig(reductions=reductions, abs_reductions=abs_reductions)
46+
...
47+
ts.TornasoleHook(..., reduction_config=rnc, ...)
48+
```
49+
**Saving by regex**
50+
```
51+
ts.get_collection('default').include(FLAGS.tornasole_include)
52+
include_collections.append('default')
53+
ts.TornasoleHook(..., include_collections=include_collections, ...)
54+
```
55+
**Setting save interval**
56+
```
57+
ts.TornasoleHook(...,save_config=ts.SaveConfig(save_interval=FLAGS.tornasole_step_interval)...)
58+
```
59+
**Setting the right mode**
60+
61+
You will see in the code that the appropriate mode has been set before the train or evaluate function calls.
62+
For example, the line:
63+
```
64+
hook.set_mode(ts.modes.TRAIN)
65+
```
66+
67+
**Adding the hook**
68+
```
69+
training_hooks = []
70+
...
71+
training_hooks.append(hook)
72+
classifier.train(
73+
input_fn=lambda: make_dataset(...),
74+
max_steps=nstep,
75+
hooks=training_hooks)
76+
```
77+
78+
## Running the example
79+
Here we provide example hyperparameters dictionaries to run this script in different scenarios from within SageMaker. You can replace the resnet_hyperparams dictionary in the notebook we provided to use the following hyperparams dictionaries to run the jobs in these scenarios.
80+
81+
### Run with synthetic or real data
82+
By default the following commands run with synthetic data. If you have ImageNet data prepared in tfrecord format,
83+
you can pass the path to that with the parameter data_dir.
84+
85+
### Saving weights and gradients with Tornasole
86+
```
87+
hyperparams = {
88+
'enable_tornasole': True,
89+
'tornasole_save_weights': True,
90+
'tornasole_save_gradients': True,
91+
'tornasole_step_interval': 100
92+
}
93+
```
94+
95+
### Simulating gradients which 'vanish'
96+
We simulate the scenario of gradients being really small (vanishing) by initializing weights with a small constant.
97+
98+
```
99+
hyperparams = {
100+
'enable_tornasole': True,
101+
'tornasole_save_weights': True,
102+
'tornasole_save_gradients': True,
103+
'tornasole_step_interval': 100,
104+
'constant_initializer': 0.01
105+
}
106+
```
107+
#### Rule: VanishingGradient
108+
To monitor this condition for the first 10000 training steps, you can setup a Vanishing Gradient rule as follows:
109+
110+
```
111+
rule_specifications=[
112+
{
113+
"RuleName": "VanishingGradient",
114+
"InstanceType": "ml.c5.4xlarge",
115+
"RuntimeConfigurations": {
116+
"end-step": "10000",
117+
}
118+
}
119+
]
120+
121+
```
122+
#### Saving activations of RELU layers in full
123+
```
124+
hyperparams = {
125+
'enable_tornasole': True,
126+
'tornasole_save_relu_activations': True,
127+
'tornasole_step_interval': 200,
128+
}
129+
```
130+
#### Saving activations of RELU layers as reductions
131+
```
132+
hyperparams = {
133+
'enable_tornasole': True,
134+
'tornasole_save_relu_activations': True,
135+
'tornasole_step_interval': 200,
136+
'tornasole_relu_reductions': 'min,max,mean,variance',
137+
'tornasole_relu_reductions_abs': 'mean,variance',
138+
}
139+
```
140+
#### Saving weights every step
141+
If you want to compute and track the ratio of weights and updates,
142+
you can do that by saving weights every step as follows
143+
```
144+
hyperparams = {
145+
'enable_tornasole': True,
146+
'tornasole_save_weights': True,
147+
'tornasole_step_interval': 1
148+
}
149+
```
150+
##### Rule: WeightUpdateRatio
151+
To monitor the weights and updates during training, you can setup a WeightUpdateRatio rule as follows:
152+
153+
```
154+
rule_specifications=[
155+
{
156+
"RuleName": "WeightUpdateRatio",
157+
"InstanceType": "ml.c5.4xlarge",
158+
}
159+
]
160+
```
161+
162+
##### Rule: UnchangedTensor
163+
You can also invoke this rule to
164+
monitor if tensors are not changing at every step. Here we are passing '.*' as the tensor_regex to monitor all tensors.
165+
```
166+
rule_specifications=[
167+
{
168+
"RuleName": "UnchangedTensor",
169+
"InstanceType": "ml.c5.4xlarge",
170+
"RuntimeConfigurations": {
171+
"tensor_regex": ".*"
172+
}
173+
}
174+
]
175+
```

0 commit comments

Comments
 (0)