You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add custom rule
* updated notebook
* Add rule script
* Expanding rule monitoring section and improving BYOR notebook (aws#180)
* Adding sagemaker example notebook
* Remocing unused training script
* Tornasole hook from config json (aws#104)
* creating tornasole hook from config
* making a quick variance fix (aws#99)
* Adding the change to convert ndarray to np.ndarray when operator is not available in mxnet.
* Cleanup and tests for TF and mxnet
* remove rmtree from s3 test
* Fixed the function invocation of get_numpy_reduction
* Changes to read from hardcoded path
* fixing pytorch test
* Setting SaveConfig per mode (aws#94)
* add doc for passing saveconfig specific to modes
* add save config for collection
* Create an option to build tornasole with no framework, TORNASOLE_FOR_RULES=1 (aws#95)
* add option to build only for rules
* Adding support to set save config per mode through json, also copying load collection method to all frameworks as that was missed
* remove set -ex from tests script since it prevents upload of reports
* move json config out of hooks
* Adding tests to create hook from tornasole configs for pytorch
* Change link of latest tornasole binaries (aws#120)
* change link to binary and introduce latest
* make container scripts working again
* remove -U
* fix path to ts binary in docker
* log when single process is to stdout
* Addressed the review comments. Added the correct asserts to check the reduction values. Added the test to test the training mode.
* Setup versioning (aws#119)
* added _verion.py and support
* fixed __init__.py
* Improve PR template (aws#128)
* Setup versioning (aws#134)
* added _verion.py and support
* fixed __init__.py
* using PEP 440 standard versioning it.
* Json Config Hook Tests (aws#129)
* added json config hook tests
* Add LossNotDecreasing rule and change how required tensors API works (aws#126)
* add loss rule and tests. refactoring rules api.
* Adding mxnet tests for hook_from_json (aws#143)
* Adding config file for reduce and save_all test scripts
* Fixing bug in mxnet reduction util
sloved issue aws#142
* Update build script for PT container
- modified S3 path to pick up from PT folder
- added parameter to enable installation of sagemaker_pytorch_container.whl into image
* mode writer support (aws#144)
* Add sagemaker docs and notebooks (aws#133)
* Changing link of latest binaries for 0.3 (aws#122)
* change link to binary and introduce latest
* make container scripts working again
* remove -U
* fix path to ts binary in docker
* log when single process is to stdout
* uploaded sagemaker docs
update analysis docs
remove sagemaker docs
update TF doc
add sagemaker docs
update api docs
change link for rules binary
add files from s3 bucket
* refactor positions
* minor changes
* fix links in old examples
* fix paths in integration tests
* Update test_training_end.py
* Update test_training_end.py
* Update integration_testing_rules.py
* bring back examples section in analysis readme
* create sagemaker-notebooks directory
* fix links
* remove accidental include of key
* update links, and update dev guide rules after changes in alpha
* Add new regions for container images (aws#147)
* update regions
* add check for tag
* add regions
* Make required tensors optional (aws#148)
* make required tensors optional
* Update README.md
* add a directory to clean in build binaries script
* Updating the notebooks to include good and bad exampels.
* Update scripts to build containers (aws#153)
* Update scripts to build containers
add a directory to clean in build binaries script
add policy
working container scripts for TF now added along with other frameworks
fix binary in container script
* Add script to tag as latest
* Sagemaker TF notebook (aws#145)
* Changing link of latest binaries for 0.3 (aws#122)
* change link to binary and introduce latest
* make container scripts working again
* remove -U
* fix path to ts binary in docker
* log when single process is to stdout
* uploaded sagemaker docs
update analysis docs
remove sagemaker docs
update TF doc
add sagemaker docs
update api docs
change link for rules binary
add files from s3 bucket
* refactor positions
* minor changes
* fix links in old examples
* fix paths in integration tests
* Update test_training_end.py
* Update test_training_end.py
* Update integration_testing_rules.py
* bring back examples section in analysis readme
* create sagemaker-notebooks directory
* fix links
* updated notebook for tf
* fix name of rule
* Delete README.md
* remove rules scripts
* Update tensorflow-simple.ipynb
* Update tensorflow-simple.ipynb
* add pytorch notebook from s3 (aws#156)
* Changes for temp location and out_dir with Sagemaker in mind (aws#154)
* Make outdir optional arg, use default path in sagemaker environment, also change temp location when writing local files
* remove is_s3 import
* add tests and fix case when / is at the front of filepath
* add comments
* change to .tmp suffix
* update testing script to take a tag
* Updated the uploader script to include pytorch scripts
* Updating the paths to the examples in the notebooks.
* Removed unnecessary copy
* resolving warning mesg of loading yaml (aws#149)
* Fix out dir bug (aws#160)
* fix out dir bug
* print mode.name instead of mode
* print mode.name instead of mode
* print mode.name instead of mode
* parallelize builds for pytorch and mxnet (aws#162)
* TF notebook (aws#163)
* Changing link of latest binaries for 0.3 (aws#122)
* change link to binary and introduce latest
* make container scripts working again
* remove -U
* fix path to ts binary in docker
* log when single process is to stdout
* uploaded sagemaker docs
update analysis docs
remove sagemaker docs
update TF doc
add sagemaker docs
update api docs
change link for rules binary
add files from s3 bucket
* refactor positions
* minor changes
* fix links in old examples
* fix paths in integration tests
* Update test_training_end.py
* Update test_training_end.py
* Update integration_testing_rules.py
* bring back examples section in analysis readme
* create sagemaker-notebooks directory
* fix links
* updated notebook for tf
* fix name of rule
* Delete README.md
* remove rules scripts
* Update tensorflow-simple.ipynb
* Update tensorflow-simple.ipynb
* add sagemaker args
* add model dir to resnet
* remove action style args in script and reindent
* update resnet example
* make num epochs take priority over num_batches
* change name of tf notebook
* Add updated sagemaker tf notebook
* change scripts to include all scripts in tf examples
* change names of estimators
* update files
* Updating the mxnet notebook
* Updating the mxnet notebook.
* Updated notebook as per review.
* Update mxnet.ipynb
* Update mxnet.ipynb
* Fixed the type of container from TensorFlow to MXNet.
* Pytorch Notebook Updates (aws#170)
* pytorch notebook
* Update pytorch.ipynb
* Update pytorch.ipynb
* Pytorch (aws#171)
* pytorch notebook
* Update pytorch.ipynb
* Update pytorch.ipynb
* Heading fix
* Expanding rule section and modifying BYOR
* make tf notebook same as alpha
* undo changes for rules, as that's now going into a different PR
* Revert "Expanding rule monitoring section and improving BYOR notebook (aws#180)"
This reverts commit 7f7c17c0f73b95f614859fa9ed05b29e50166eec.
* Add first party rules file
* update cloudwatch section
This rule helps you identify if you are running into a situation where your gradients vanish, i.e. have a
5
+
really low or zero magnitude.
6
+
7
+
Here's how you import and instantiate this rule.
8
+
Note that it takes two parameters, `base_trial` the trial whose execution will invoke the rule, and a `threshold` which is
9
+
used to determine whether the gradient is `vanishing`. Gradients whose mean of absolute values are lower than this threshold
10
+
will return True when we invoke this rule.
11
+
12
+
The default threshold is `0.0000001`.
13
+
14
+
```
15
+
from tornasole.rules.generic import VanishingGradient
16
+
r = VanishingGradient(base_trial, threshold=0.0000001)
17
+
```
18
+
19
+
### ExplodingTensor
20
+
This rule helps you identify if you are running into a situation where any tensor has non finite values.
21
+
22
+
Note that it takes two parameters, `base_trial` the trial whose execution will invoke the rule and
23
+
`only_nan` which can be set to True if you only want to monitor for `nan` and not for `infinity`. By default, `only_nan` is set to False, which means it treats `nan` and `infinity` as exploding.
24
+
25
+
Here's how you import and instantiate this rule.
26
+
27
+
```
28
+
from tornasole.rules.generic import ExplodingTensor
29
+
r = ExplodingTensor(base_trial)
30
+
```
31
+
32
+
### SimilarAcrossRuns
33
+
This rule helps you compare tensors across runs. Note that this rule takes two trials as inputs. First trial is the `base_trial` whose
34
+
execution will invoke the rule, and the `other_trial` is what is used to compare this trial's tensors with.
35
+
The third argument is a regex pattern which can be used to restrict this comparision to certain tensors. If this is not passed, it includes all tensors by default.
36
+
37
+
It returns `True` if tensors are different at a given step between the two trials.
38
+
39
+
```
40
+
from tornasole.rules.generic import SimilarAcrossRuns
41
+
r = SimilarAcrossRuns(base_trial, other_trial, include=None)
42
+
```
43
+
44
+
### WeightUpdateRatio
45
+
This rule helps you keep track of the ratio of the updates to weights during training. It takes the following arguments:
46
+
47
+
-`base_trial`: The trial whose execution will invoke the rule. The rule will inspect the tensors gathered during this trial.
48
+
-`large_threshold`: float, defaults to 10.0: maximum value that the ratio can take before rule returns True
49
+
-`small_threshold`: float, defaults to 0.00000001: minimum value that the ratio can take. the rule returns True if the ratio is lower than this
50
+
-`epsilon`: float, defaults to 0.000000001: small constant to ensure that we do not divide by 0 when computing ratio
51
+
52
+
This rule returns True if the ratio of updates to weights is larger than `large_threshold` or when this ratio is smaller than `small_threshold`.
53
+
54
+
It is a good sign for training when the updates are in a good scale
55
+
compared to the gradients. Very large updates can push weights away from optima,
56
+
and very small updates mean slow convergence.
57
+
58
+
**Note that for this rule to be executed, weights have to be available for two consecutive steps, so save_interval needs to be 1**
59
+
60
+
```
61
+
from tornasole.rules.generic import WeightUpdateRatio
This rule helps to identify whether the tensors contain all zeros. It takes following arguments
67
+
68
+
-`base_trial`: The trial whose execution will invoke the rule. The rule will inspect the tensors gathered during this trial.
69
+
-`collection_names`: The list of collection names. The rule will inspect the tensors that belong to these collections.
70
+
-`tensor_regex`: The list of regex patterns. The rule will inspect the tensors that match the regex patterns specified in this list.
71
+
72
+
For this rule, users must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified the rule will inspect union on tensors.
This rule helps to identify whether a tensor is not changing across steps.
83
+
This rule runs `numpy.allclose` method to check if the tensor is unchanged.
84
+
It takes following arguments
85
+
86
+
-`base_trial`: The trial whose execution will invoke the rule.
87
+
The rule will inspect the tensors gathered during this trial.
88
+
-`collection_names`: The list of collection names.
89
+
The rule will inspect the tensors that belong to these collections.
90
+
If both collection_names and tensor_regex are specified, the rule will check for union of tensors.
91
+
-`tensor_regex`: The list of regex patterns.
92
+
The rule will inspect the tensors that match the regex patterns specified in this list.
93
+
If both collection_names and tensor_regex are specified, the rule will check for union of tensors.
94
+
-`num_steps`: int (default is 3). Number of steps across which we check if the tensor has changed.
95
+
Note that this checks the last num_steps that are available.
96
+
They need not be consecutive.
97
+
If num_steps is 2, at step `s` it does not necessarily check for s-1 and s.
98
+
If s-1 is not available, it checks the last available step along with s.
99
+
In that case it checks the last available step with the current step.
100
+
-`rtol`: The relative tolerance parameter, as a float, to be passed to numpy.allclose.
101
+
-`atol`: The absolute tolerance parameter, as a float, to be passed to numpy.allclose
102
+
-`equal_nan`: Whether to compare NaN’s as equal. If True, NaN’s in a will be considered
103
+
equal to NaN’s in b in the output array. This will be passed to numpy.allclose method
104
+
105
+
For this rule, users must specify either the `collection_names` or `tensor_regex` parameter.
106
+
If both the parameters are specified the rule will inspect union on tensors.
107
+
108
+
```
109
+
from tornasole.rules.generic import UnchangedTensor
110
+
ut = UnchangedTensor(base_trial=trial_obj, tensor_regex=['.*'], num_steps=3)
111
+
```
112
+
113
+
### LossNotDecreasing
114
+
This rule helps you identify if you are running into a situation where loss is not going down.
115
+
Note that these losses have to be scalars. It takes the following arguments.
116
+
117
+
-`base_trial`: The trial whose execution will invoke the rule.
118
+
The rule will inspect the tensors gathered during this trial.
119
+
-`collection_names`: The list of collection names.
120
+
The rule will inspect the tensors that belong to these collections.
121
+
Note that only scalar tensors will be picked.
122
+
-`tensor_regex`: The list of regex patterns.
123
+
The rule will inspect the tensors that match the regex patterns specified in this list.
124
+
Note that only scalar tensors will be picked.
125
+
-`use_losses_collection`: bool (default is True)
126
+
If this is True, it looks for losses from the collection 'losses' if present.
127
+
-`num_steps`: int (default is 10). The minimum number of steps after which
128
+
we want which we check if the loss has decreased. The rule evaluation happens every num_steps, and
129
+
the rule checks the loss for this step with the loss at the newest step
130
+
which is at least num_steps behind the current step.
131
+
For example, if the loss is being saved every 3 steps, but num_steps is 10. At step 21, loss
132
+
for step 21 is compared with the loss for step 9. The next step where loss is checked is at 33,
133
+
since 10 steps after 21 is 31, and at 31 and 32 loss is not being saved.
134
+
-`diff_percent`: float (default is 0.0) (between 0.0 and 100.0)
135
+
The minimum difference in percentage that loss should be lower by. By default, the rule just checks if loss is going down. If you want to specify a stricter check that loss is going down fast enough, you might want to pass diff_percent.
136
+
137
+
```
138
+
from tornasole.rules.generic import LossNotDecreasing
This rule helps to identify whether the tensors contain all zeros. It takes following arguments
534
-
535
-
-`base_trial`: The trial whose execution will invoke the rule. The rule will inspect the tensors gathered during this trial.
536
-
-`collection_names`: The list of collection names. The rule will inspect the tensors that belong to these collections.
537
-
-`tensor_regex`: The list of regex patterns. The rule will inspect the tensors that match the regex patterns specified in this list.
538
-
539
-
For this rule, users must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified the rule will inspect union on tensors.
We have end-to-end flow example from saving tensors to plotting using saved tensors for [MXNet](../../examples/mxnet/notebooks) and [PyTorch](../../examples/pytorch/notebooks).
0 commit comments