Skip to content

Commit a0be1eb

Browse files
authored
Add losses support to Tensorflow (aws#135)
* add loss support * remove accidentally included file * add loss not decreasing rule * add loss rule and test * Added integration tests * fix tests * fix doc * fix test * fix test * fix test * fix test * stdout level * fix train * fix integration test * bring back correct logging behavior * add log * fix log name * rename job log * address review * fix integration test * change behavior of required tensors, and simplify loss rule * introduce a class for required tensors and use old one as for each trial * undo changes of analysis * add integraion tests * remove new file * remove loss rule * remove test * change utils method * add losses to sagemaker doc as well * fix loss test
1 parent 8399717 commit a0be1eb

File tree

17 files changed

+157
-40
lines changed

17 files changed

+157
-40
lines changed

docs/tensorflow/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,25 @@ Then, you need to pass `gradients` in the `include_collections` parameter of the
431431
import tornasole.tensorflow as ts
432432
hook = ts.TornasoleHook(..., include_collections = ['gradients'], ...)
433433
```
434+
#### Losses
435+
If you are using the default loss functions in Tensorflow, Tornasole can automatically pick up these losses from Tensorflow's losses collection.
436+
In such a case, we only need to specify 'losses' in the `include_collections` argument of the hook.
437+
If you do not pass this argument to the hook, it will save losses by default.
438+
If you are using your custom loss function, you can either add this to Tensorflow's losses collection or Tornasole's losses collection as follows:
439+
440+
```
441+
import tornasole.tensorflow as ts
442+
443+
# if your loss function is not a default TF loss function,
444+
# but is a custom loss function
445+
# then add to the collection losses
446+
loss = ...
447+
ts.add_to_collection('losses', loss)
448+
449+
# specify losses in include_collections
450+
# Note that this is included by default
451+
hook = ts.TornasoleHook(..., include_collections = ['losses'..], ...)
452+
```
434453

435454
#### Optimizer Variables
436455
Optimizer variables such as momentum can also be saved easily with the

docs/tensorflow/examples/mnist.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,17 @@ This will also enable us to access the gradients during analysis without having
1818
```
1919
opt = TornasoleOptimizer(opt)
2020
optimizer_op = optimizer.minimize(loss, global_step=increment_global_step_op)
21-
22-
ts.TornasoleHook(..., include_collections=[..,'gradients'], ...)
2321
```
22+
Note that here since by default Tornasole tries to save weights, gradients and losses
23+
we didn't need to specify 'gradients' in the include_collections argument of the hook.
24+
25+
**Saving losses**
26+
27+
Since we use a default loss function from Tensorflow here,
28+
we would only need to indicate to the hook that we want to include losses.
29+
But since the hook by default saves losses if include_collections argument was not set,
30+
we need not do anything.
31+
2432
**Setting save interval**
2533

2634
You can set different save intervals for different modes.

docs/tensorflow/examples/resnet50.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,11 @@ import tornasole.tensorflow as ts
1717
**Saving weights**
1818
```
1919
include_collections.append('weights')
20+
ts.TornasoleHook(..., include_collections=include_collections, ...)
2021
```
22+
Note that the above line of include_collections is not required
23+
because by default Tornasole tries to save weights, gradients and losses.
24+
2125
**Saving gradients**
2226

2327
We need to wrap our optimizer with TornasoleOptimizer, and use this optimizer to minimize loss.
@@ -28,6 +32,18 @@ opt = TornasoleOptimizer(opt)
2832
include_collections.append('gradients')
2933
ts.TornasoleHook(..., include_collections=include_collections, ...)
3034
```
35+
Note that if include_collections is not passed to the hook,
36+
by default Tornasole tries to save weights, gradients and losses.
37+
38+
**Saving losses**
39+
40+
Since we use a default loss function from Tensorflow, we only need to indicate to the hook that we want to include losses.
41+
In the code, you will see the following line to do so.
42+
```
43+
include_collections=['losses']
44+
ts.TornasoleHook(..., include_collections=include_collections, ...)
45+
```
46+
3147
**Saving relu activations by variable**
3248
```
3349
x = tf.nn.relu(x + shortcut)

docs/tensorflow/examples/simple.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,18 @@ optimizer_op = optimizer.minimize(loss, global_step=increment_global_step_op)
2424
2525
ts.TornasoleHook(..., include_collections=[..,'gradients'], ...)
2626
```
27+
**Saving losses**
28+
29+
Since we are not using a default loss function from Tensorflow,
30+
we need to tell Tornasole to add our loss to the losses collection as follows
31+
```
32+
ts.add_to_collection('losses', loss)
33+
```
34+
In the code, you will see the following line to do so.
35+
```
36+
ts.TornasoleHook(..., include_collections=[...,'losses'], ...)
37+
```
38+
2739
**Setting save interval**
2840
```
2941
ts.TornasoleHook(...,save_config=ts.SaveConfig(save_interval=args.tornasole_frequency)...)

examples/tensorflow/scripts/mnist.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import argparse
22
import numpy as np
3+
import random
34
import tensorflow as tf
45
import tornasole.tensorflow as ts
56

@@ -9,6 +10,8 @@
910
help="How often to save TS data", default=50)
1011
parser.add_argument('--tornasole_eval_frequency', type=int,
1112
help="How often to save TS data", default=10)
13+
parser.add_argument('--lr', type=float, default=0.001)
14+
parser.add_argument('--random_seed', type=bool, default=False)
1215
parser.add_argument('--num_epochs', type=int, default=5,
1316
help="Number of epochs to train for")
1417
parser.add_argument('--num_steps', type=int,
@@ -17,6 +20,14 @@
1720
parser.add_argument('--model_dir', type=str, default='/tmp/mnist_model')
1821
args = parser.parse_args()
1922

23+
# these random seeds are only intended for test purpose.
24+
# for now, 2,2,12 could promise no assert failure when running tornasole_rules test_rules.py with config.yaml
25+
# if you wish to change the number, notice that certain steps' tensor value may be capable of variation
26+
if args.random_seed:
27+
tf.set_random_seed(2)
28+
np.random.seed(2)
29+
random.seed(12)
30+
2031
def cnn_model_fn(features, labels, mode):
2132
"""Model function for CNN."""
2233
# Input Layer
@@ -67,7 +78,7 @@ def cnn_model_fn(features, labels, mode):
6778

6879
# Configure the Training Op (for TRAIN mode)
6980
if mode == tf.estimator.ModeKeys.TRAIN:
70-
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
81+
optimizer = tf.train.GradientDescentOptimizer(learning_rate=args.lr)
7182
optimizer = ts.TornasoleOptimizer(optimizer)
7283
train_op = optimizer.minimize(
7384
loss=loss,

examples/tensorflow/scripts/simple.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
w0 = [[1], [1.]]
3838
y = tf.matmul(x, w0)
3939
loss = tf.reduce_mean((tf.matmul(x, w) - y) ** 2, name="loss")
40+
ts.add_to_collection('losses', loss)
4041

4142
global_step = tf.Variable(17, name="global_step", trainable=False)
4243
increment_global_step_op = tf.assign(global_step, global_step+1)
@@ -56,7 +57,7 @@
5657
# Note that we are saving all tensors here by passing save_all=True
5758
hook = ts.TornasoleHook(out_dir=args.tornasole_path,
5859
save_all=True,
59-
include_collections=['weights', 'gradients'],
60+
include_collections=['weights', 'gradients', 'losses'],
6061
save_config=ts.SaveConfig(save_interval=args.tornasole_frequency),
6162
reduction_config=rdnc)
6263

examples/tensorflow/scripts/train_imagenet_resnet_hvd.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1020,7 +1020,7 @@ def get_tornasole_hook(FLAGS):
10201020
else:
10211021
rnc = None
10221022

1023-
include_collections = []
1023+
include_collections = ['losses']
10241024

10251025
if FLAGS.tornasole_save_weights is True:
10261026
include_collections.append('weights')

sagemaker-docs/DeveloperGuide_TF.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -304,6 +304,25 @@ import tornasole.tensorflow as ts
304304
hook = ts.TornasoleHook(..., include_collections = ['gradients'], ...)
305305
```
306306

307+
#### Losses
308+
If you are using the default loss functions in Tensorflow, Tornasole can automatically pick up these losses from Tensorflow's losses collection.
309+
In such a case, we only need to specify 'losses' in the `include_collections` argument of the hook.
310+
If you do not pass this argument to the hook, it will save losses by default.
311+
If you are using your custom loss function, you can either add this to Tensorflow's losses collection or Tornasole's losses collection as follows:
312+
```
313+
import tornasole.tensorflow as ts
314+
315+
# if your loss function is not a default TF loss function,
316+
# but is a custom loss function
317+
# then add to the collection losses
318+
loss = ...
319+
ts.add_to_collection('losses', loss)
320+
321+
# specify losses in include_collections
322+
# Note that this is included by default
323+
hook = ts.TornasoleHook(..., include_collections = ['losses'..], ...)
324+
```
325+
307326
#### Optimizer Variables
308327
Optimizer variables such as momentum can also be saved easily with the
309328
above approach of wrapping your optimizer with `TornasoleOptimizer`

tests/analysis/config.yaml

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66
- s3: True # run test cases in s3 mode
77
- [simple.py: &simple
88
$CODEBUILD_SRC_DIR/examples/tensorflow/scripts/simple.py,
9+
tf_mnist.py: &tf_mnist
10+
$CODEBUILD_SRC_DIR/examples/tensorflow/scripts/mnist.py,
911
torch_simple.py: &torch_simple
1012
$CODEBUILD_SRC_DIR/examples/pytorch/scripts/simple.py,
1113
train_imagenet_resnet_hvd.py: &train_imagenet_resnet_hvd
@@ -82,7 +84,24 @@
8284
*invoker,
8385
--rule_name weightupdateratio --flag True --end_step 71
8486
]
85-
87+
-
88+
- loss_not_decreasing/tf/true
89+
- tensorflow
90+
- *Enable
91+
- [*tf_mnist,
92+
--lr 0.001 --tornasole_train_frequency 10 --random_seed True,
93+
*invoker,
94+
--rule_name lossnotdecreasing --flag True --end_step 1000
95+
]
96+
-
97+
- loss_not_decreasing/tf/false
98+
- tensorflow
99+
- *Enable
100+
- [*simple,
101+
--lr 0.05 --scale 1 --steps 1009 --tornasole_frequency 13 --random_seed True,
102+
*invoker,
103+
--rule_name lossnotdecreasing --flag False --num_steps 100 --min_difference 12
104+
]
86105

87106
# test cases for mxnet
88107
-

tests/analysis/utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ def generate_data(path, trial, step, tname_prefix,
1818
if export_colls:
1919
c = CollectionManager()
2020
c.add("default")
21-
c.get("default").tensor_names = [ tname_prefix + '_' + str(i) for i in range(num_tensors)]
21+
c.get("default").tensor_names = [f'{tname_prefix}_{i}' for i in range(num_tensors)]
2222
c.add('gradients')
23-
c.get("gradients").tensor_names = [ tname_prefix + '_' + str(i) for i in range(num_tensors)]
23+
c.get("gradients").tensor_names = [f'{tname_prefix}_{i}' for i in range(num_tensors)]
2424
c.export(os.path.join(path, trial, "collections.ts"))
2525

2626

tests/tensorflow/hooks/test_estimator_modes.py

Lines changed: 9 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -138,10 +138,9 @@ def test_mnist_local():
138138
assert len(tr.available_steps()) == 55
139139
assert len(tr.available_steps(mode=ts.modes.TRAIN)) == 15
140140
assert len(tr.available_steps(mode=ts.modes.EVAL)) == 40
141-
assert len(tr.tensors()) == 16
141+
assert len(tr.tensors()) == 17
142142
shutil.rmtree(trial_dir)
143143

144-
145144
def test_mnist_local_json():
146145
out_dir = 'newlogsRunTest1/test_mnist_local_json_config'
147146
shutil.rmtree(out_dir, ignore_errors=True)
@@ -151,52 +150,47 @@ def test_mnist_local_json():
151150
assert len(tr.available_steps()) == 55
152151
assert len(tr.available_steps(mode=ts.modes.TRAIN)) == 15
153152
assert len(tr.available_steps(mode=ts.modes.EVAL)) == 40
154-
assert len(tr.tensors()) == 16
153+
assert len(tr.tensors()) == 17
155154
shutil.rmtree(out_dir, ignore_errors=True)
156155

157-
158156
def test_mnist_s3():
159157
run_id = 'trial_' + datetime.now().strftime('%Y%m%d-%H%M%S%f')
160158
trial_dir = 's3://tornasole-testing/tornasole_tf/hooks/estimator_modes/' + run_id
161159
tr = help_test_mnist(trial_dir, ts.SaveConfig(save_interval=2))
162160
assert len(tr.available_steps()) == 55
163161
assert len(tr.available_steps(mode=ts.modes.TRAIN)) == 15
164162
assert len(tr.available_steps(mode=ts.modes.EVAL)) == 40
165-
assert len(tr.tensors()) == 16
166-
163+
assert len(tr.tensors()) == 17
167164

168165
def test_mnist_local_multi_save_configs():
169166
run_id = 'trial_' + datetime.now().strftime('%Y%m%d-%H%M%S%f')
170167
trial_dir = os.path.join(TORNASOLE_TF_HOOK_TESTS_DIR, run_id)
171168
tr = help_test_mnist(trial_dir, {ts.modes.TRAIN: ts.SaveConfig(save_interval=2),
172-
ts.modes.EVAL: ts.SaveConfig(save_interval=1)})
169+
ts.modes.EVAL: ts.SaveConfig(save_interval=1)})
173170
assert len(tr.available_steps()) == 94
174171
assert len(tr.available_steps(mode=ts.modes.TRAIN)) == 15
175172
assert len(tr.available_steps(mode=ts.modes.EVAL)) == 79
176-
assert len(tr.tensors()) == 16
173+
assert len(tr.tensors()) == 17
177174
shutil.rmtree(trial_dir)
178175

179-
180176
def test_mnist_s3_multi_save_configs():
181177
run_id = 'trial_' + datetime.now().strftime('%Y%m%d-%H%M%S%f')
182178
trial_dir = 's3://tornasole-testing/tornasole_tf/hooks/estimator_modes/' + run_id
183179
tr = help_test_mnist(trial_dir, {ts.modes.TRAIN: ts.SaveConfig(save_interval=2),
184-
ts.modes.EVAL: ts.SaveConfig(save_interval=1)})
180+
ts.modes.EVAL: ts.SaveConfig(save_interval=1)})
185181
assert len(tr.available_steps()) == 94
186182
assert len(tr.available_steps(mode=ts.modes.TRAIN)) == 15
187183
assert len(tr.available_steps(mode=ts.modes.EVAL)) == 79
188-
assert len(tr.tensors()) == 16
189-
184+
assert len(tr.tensors()) == 17
190185

191186
def test_mnist_local_multi_save_configs_json():
192187
out_dir = 'newlogsRunTest1/test_save_config_modes_hook_config'
193188
shutil.rmtree(out_dir, ignore_errors=True)
194-
os.environ[
195-
TORNASOLE_CONFIG_FILE_PATH_ENV_STR] = 'tests/tensorflow/hooks/test_json_configs/test_save_config_modes_hook_config.json'
189+
os.environ[TORNASOLE_CONFIG_FILE_PATH_ENV_STR] = 'tests/tensorflow/hooks/test_json_configs/test_save_config_modes_hook_config.json'
196190
hook = ts.TornasoleHook.hook_from_config()
197191
tr = help_test_mnist(out_dir, hook=hook)
198192
assert len(tr.available_steps()) == 94
199193
assert len(tr.available_steps(mode=ts.modes.TRAIN)) == 15
200194
assert len(tr.available_steps(mode=ts.modes.EVAL)) == 79
201-
assert len(tr.tensors()) == 16
195+
assert len(tr.tensors()) == 17
202196
shutil.rmtree(out_dir)

tests/tensorflow/hooks/test_losses.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
from .utils import *
2+
import tornasole.tensorflow as ts
3+
import shutil
4+
5+
from .test_estimator_modes import help_test_mnist
6+
7+
def test_mnist_local():
8+
run_id = 'trial_' + datetime.now().strftime('%Y%m%d-%H%M%S%f')
9+
trial_dir = os.path.join(TORNASOLE_TF_HOOK_TESTS_DIR, run_id)
10+
tr = help_test_mnist(trial_dir, ts.SaveConfig(save_interval=2))
11+
assert len(tr.collection('losses').get_tensor_names()) == 1
12+
for t in tr.collection('losses').get_tensor_names():
13+
assert len(tr.tensor(t).steps()) == 55
14+
shutil.rmtree(trial_dir)

tests/tensorflow/hooks/test_reductions.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ def helper_test_reductions(trial_dir, hook):
1515
from tornasole.trials import create_trial
1616

1717
tr = create_trial(trial_dir)
18-
assert len(tr.tensors()) == 2
18+
assert len(tr.tensors()) == 3
1919
for tname in tr.tensors():
2020
t = tr.tensor(tname)
2121
try:

tests/tensorflow/hooks/test_save_all_full.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from .utils import *
2-
from tornasole.tensorflow import reset_collections, get_collections
2+
from tornasole.tensorflow import reset_collections, get_collections, CollectionManager, Collection
33
import shutil, glob
44
from tornasole.core.reader import FileReader
55
from tornasole.core.json_config import TORNASOLE_CONFIG_FILE_PATH_ENV_STR
@@ -23,16 +23,18 @@ def test_save_all_full(hook=None, trial_dir=None):
2323
dirs, _ = get_dirs_files(os.path.join(trial_dir, 'events'))
2424

2525
coll = get_collections()
26-
assert len(coll) == 5
26+
assert len(coll) == 6
2727
assert len(coll['weights'].tensor_names) == 1
2828
assert len(coll['gradients'].tensor_names) == 1
29+
assert len(coll['losses'].tensor_names) == 1
2930

3031
assert 'collections.ts' in files
3132
cm = CollectionManager.load(join(trial_dir, 'collections.ts'))
3233

33-
assert len(cm.collections) == 5
34+
assert len(cm.collections) == 6
3435
assert len(cm.collections['weights'].tensor_names) == 1
3536
assert len(cm.collections['weights'].reduction_tensor_names) == 0
37+
assert len(cm.collections['losses'].tensor_names) == 1
3638
assert len(cm.collections['gradients'].tensor_names) == 1
3739
assert len(cm.collections['gradients'].reduction_tensor_names) == 0
3840
# as we hadn't asked to be saved

tests/tensorflow/hooks/test_save_reductions.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from .utils import *
2-
from tornasole.tensorflow import reset_collections, get_collections
2+
from tornasole.tensorflow import reset_collections, get_collections, CollectionManager
33
import shutil
44
import glob
55
from tornasole.core.reader import FileReader
@@ -10,13 +10,13 @@ def helper_save_reductions(trial_dir, hook):
1010
_, files = get_dirs_files(trial_dir)
1111
coll = get_collections()
1212

13-
assert len(coll) == 4
13+
assert len(coll) == 5
1414
assert len(coll['weights'].reduction_tensor_names) == 1
1515
assert len(coll['gradients'].reduction_tensor_names) == 1
1616

1717
assert 'collections.ts' in files
1818
cm = CollectionManager.load(join(trial_dir, 'collections.ts'))
19-
assert len(cm.collections) == 4
19+
assert len(cm.collections) == 5
2020
assert len(cm.collections['weights'].tensor_names) == 0
2121
assert len(cm.collections['weights'].reduction_tensor_names) == 1
2222
assert len(cm.collections['gradients'].tensor_names) == 0
@@ -45,8 +45,8 @@ def helper_save_reductions(trial_dir, hook):
4545
tensor_name, step, tensor_data, mode, mode_step = x
4646
i += 1
4747
size += tensor_data.nbytes if tensor_data is not None else 0
48-
assert i == 32
49-
assert size == 128
48+
assert i == 48
49+
assert size == 192
5050

5151
shutil.rmtree(trial_dir)
5252

0 commit comments

Comments
 (0)