|
| 1 | +# ResNet50 Imagenet Example |
| 2 | +We provide an example script `train_imagenet_resnet_hvd.py` which is a Tornasole-enabled TensorFlow training script for ResNet50/ImageNet. |
| 3 | +**Please note that this script needs a GPU**. |
| 4 | +It uses the Estimator interface of TensorFlow. |
| 5 | +Here we show different scenarios of how to use Tornasole to |
| 6 | +save different tensors during training for analysis. |
| 7 | +Below are listed the changes we made to integrate these different |
| 8 | +behaviors of Tornasole as well as example commands for you to try. |
| 9 | + |
| 10 | +## Integrating Tornasole |
| 11 | +Below we call out the changes for Tornasole in the above script and describe them |
| 12 | + |
| 13 | +**Importing TornasoleTF** |
| 14 | +``` |
| 15 | +import tornasole.tensorflow as ts |
| 16 | +``` |
| 17 | +**Saving weights** |
| 18 | +``` |
| 19 | +include_collections.append('weights') |
| 20 | +``` |
| 21 | +**Saving gradients** |
| 22 | + |
| 23 | +We need to wrap our optimizer with TornasoleOptimizer, and use this optimizer to minimize loss. |
| 24 | +This will also enable us to access the gradients during analysis without having to identify which tensors out of the saved ones are the gradients. |
| 25 | +``` |
| 26 | +opt = TornasoleOptimizer(opt) |
| 27 | +
|
| 28 | +include_collections.append('gradients') |
| 29 | +ts.TornasoleHook(..., include_collections=include_collections, ...) |
| 30 | +``` |
| 31 | +**Saving relu activations by variable** |
| 32 | +``` |
| 33 | +x = tf.nn.relu(x + shortcut) |
| 34 | +ts.add_to_collection('relu_activations', x) |
| 35 | +... |
| 36 | +include_collections.append('relu_activations') |
| 37 | +ts.TornasoleHook(..., include_collections=include_collections, ...) |
| 38 | +``` |
| 39 | +**Saving relu activations as reductions** |
| 40 | +``` |
| 41 | +
|
| 42 | +x = tf.nn.relu(x + shortcut) |
| 43 | +ts.add_to_collection('relu_activations', x) |
| 44 | +... |
| 45 | +rnc = ts.ReductionConfig(reductions=reductions, abs_reductions=abs_reductions) |
| 46 | +... |
| 47 | +ts.TornasoleHook(..., reduction_config=rnc, ...) |
| 48 | +``` |
| 49 | +**Saving by regex** |
| 50 | +``` |
| 51 | +ts.get_collection('default').include(FLAGS.tornasole_include) |
| 52 | +include_collections.append('default') |
| 53 | +ts.TornasoleHook(..., include_collections=include_collections, ...) |
| 54 | +``` |
| 55 | +**Setting save interval** |
| 56 | +``` |
| 57 | +ts.TornasoleHook(...,save_config=ts.SaveConfig(save_interval=FLAGS.tornasole_step_interval)...) |
| 58 | +``` |
| 59 | +**Setting the right mode** |
| 60 | + |
| 61 | +You will see in the code that the appropriate mode has been set before the train or evaluate function calls. |
| 62 | +For example, the line: |
| 63 | +``` |
| 64 | +hook.set_mode(ts.modes.TRAIN) |
| 65 | +``` |
| 66 | + |
| 67 | +**Adding the hook** |
| 68 | +``` |
| 69 | +training_hooks = [] |
| 70 | +... |
| 71 | +training_hooks.append(hook) |
| 72 | +classifier.train( |
| 73 | + input_fn=lambda: make_dataset(...), |
| 74 | + max_steps=nstep, |
| 75 | + hooks=training_hooks) |
| 76 | +``` |
| 77 | + |
| 78 | +## Running the example |
| 79 | +Here we provide example hyperparameters dictionaries to run this script in different scenarios from within SageMaker. You can replace the resnet_hyperparams dictionary in the notebook we provided to use the following hyperparams dictionaries to run the jobs in these scenarios. |
| 80 | + |
| 81 | +### Run with synthetic or real data |
| 82 | +By default the following commands run with synthetic data. If you have ImageNet data prepared in tfrecord format, |
| 83 | + you can pass the path to that with the parameter data_dir. |
| 84 | + |
| 85 | +### Saving weights and gradients with Tornasole |
| 86 | +``` |
| 87 | +hyperparams = { |
| 88 | + 'enable_tornasole': True, |
| 89 | + 'tornasole_save_weights': True, |
| 90 | + 'tornasole_save_gradients': True, |
| 91 | + 'tornasole_step_interval': 100 |
| 92 | +} |
| 93 | +``` |
| 94 | + |
| 95 | +### Simulating gradients which 'vanish' |
| 96 | +We simulate the scenario of gradients being really small (vanishing) by initializing weights with a small constant. |
| 97 | + |
| 98 | +``` |
| 99 | +hyperparams = { |
| 100 | + 'enable_tornasole': True, |
| 101 | + 'tornasole_save_weights': True, |
| 102 | + 'tornasole_save_gradients': True, |
| 103 | + 'tornasole_step_interval': 100, |
| 104 | + 'constant_initializer': 0.01 |
| 105 | +} |
| 106 | +``` |
| 107 | +#### Rule: VanishingGradient |
| 108 | +To monitor this condition for the first 10000 training steps, you can setup a Vanishing Gradient rule as follows: |
| 109 | + |
| 110 | +``` |
| 111 | +rule_specifications=[ |
| 112 | + { |
| 113 | + "RuleName": "VanishingGradient", |
| 114 | + "InstanceType": "ml.c5.4xlarge", |
| 115 | + "RuntimeConfigurations": { |
| 116 | + "end-step": "10000", |
| 117 | + } |
| 118 | + } |
| 119 | +] |
| 120 | +
|
| 121 | +``` |
| 122 | +#### Saving activations of RELU layers in full |
| 123 | +``` |
| 124 | +hyperparams = { |
| 125 | + 'enable_tornasole': True, |
| 126 | + 'tornasole_save_relu_activations': True, |
| 127 | + 'tornasole_step_interval': 200, |
| 128 | +} |
| 129 | +``` |
| 130 | +#### Saving activations of RELU layers as reductions |
| 131 | +``` |
| 132 | +hyperparams = { |
| 133 | + 'enable_tornasole': True, |
| 134 | + 'tornasole_save_relu_activations': True, |
| 135 | + 'tornasole_step_interval': 200, |
| 136 | + 'tornasole_relu_reductions': 'min,max,mean,variance', |
| 137 | + 'tornasole_relu_reductions_abs': 'mean,variance', |
| 138 | +} |
| 139 | +``` |
| 140 | +#### Saving weights every step |
| 141 | +If you want to compute and track the ratio of weights and updates, |
| 142 | +you can do that by saving weights every step as follows |
| 143 | +``` |
| 144 | +hyperparams = { |
| 145 | + 'enable_tornasole': True, |
| 146 | + 'tornasole_save_weights': True, |
| 147 | + 'tornasole_step_interval': 1 |
| 148 | +} |
| 149 | +``` |
| 150 | +##### Rule: WeightUpdateRatio |
| 151 | +To monitor the weights and updates during training, you can setup a WeightUpdateRatio rule as follows: |
| 152 | + |
| 153 | +``` |
| 154 | +rule_specifications=[ |
| 155 | + { |
| 156 | + "RuleName": "WeightUpdateRatio", |
| 157 | + "InstanceType": "ml.c5.4xlarge", |
| 158 | + } |
| 159 | +] |
| 160 | +``` |
| 161 | + |
| 162 | +##### Rule: UnchangedTensor |
| 163 | +You can also invoke this rule to |
| 164 | +monitor if tensors are not changing at every step. Here we are passing '.*' as the tensor_regex to monitor all tensors. |
| 165 | +``` |
| 166 | +rule_specifications=[ |
| 167 | + { |
| 168 | + "RuleName": "UnchangedTensor", |
| 169 | + "InstanceType": "ml.c5.4xlarge", |
| 170 | + "RuntimeConfigurations": { |
| 171 | + "tensor_regex": ".*" |
| 172 | + } |
| 173 | + } |
| 174 | +] |
| 175 | +``` |
0 commit comments