|
| 1 | +# Tornasole for XGBoost |
| 2 | + |
| 3 | +Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training. |
| 4 | + |
| 5 | +Using Tornasole is a two step process: |
| 6 | + |
| 7 | +**Saving tensors** |
| 8 | +This needs the `tornasole` package built for the appropriate framework. This package lets you collect the tensors you want at the frequency that you want, and save them for analysis. |
| 9 | +Please follow the appropriate Readme page to install the correct version. This page is for using Tornasole with XGBoost. |
| 10 | + |
| 11 | +**Analysis** |
| 12 | +Please refer to [this page](../rules/README.md) for more details about how to run rules and other analysis |
| 13 | +on tensors collection from the job. That said, we do provide a few example analysis commands below |
| 14 | +so as to provide an end to end flow. The analysis of these tensors can be done on a separate machine |
| 15 | +in parallel with the training job. |
| 16 | + |
| 17 | +## Installation |
| 18 | + |
| 19 | +#### Prerequisites |
| 20 | + |
| 21 | +- **Python 3.6** |
| 22 | +- Tornasole can work in local mode or remote(s3) mode. You can skip this, if you want to try [local mode example](#tornasole-local-mode-example). |
| 23 | +This is necessary to setup if you want to try [s3 mode example](#tornasole-s3-mode-example). |
| 24 | +For running in S3 mode, you need to make sure that instance you are using has proper credentials set to have S3 write access. |
| 25 | +Try the below command - |
| 26 | +``` |
| 27 | + aws s3 ls |
| 28 | +``` |
| 29 | +If you see errors, then most probably your credentials are not properly set. |
| 30 | +Please follow [FAQ on S3](#s3access) to make sure that your instance has proper S3 access. |
| 31 | + |
| 32 | +#### Instructions |
| 33 | + |
| 34 | +**Make sure that your aws account is whitelisted for Tornasole. [ContactUs](#contactus)**. |
| 35 | + |
| 36 | +Once your account is whitelisted, you should be able to install the `tornasole` package built for XGBoost as follows: |
| 37 | + |
| 38 | +``` |
| 39 | +aws s3 sync s3://tornasole-binaries-use1/tornasole_xgboost/py3/latest/ tornasole_xgboost/ |
| 40 | +pip install tornasole_xgboost/tornasole-* |
| 41 | +``` |
| 42 | + |
| 43 | +**Please note** : If, while installing tornasole, you get a version conflict issue between botocore and boto3, |
| 44 | +you might need to run the following |
| 45 | +``` |
| 46 | +pip uninstall -y botocore boto3 aioboto3 aiobotocore && pip install botocore==1.12.91 boto3==1.9.91 aiobotocore==0.10.2 aioboto3==6.4.1 |
| 47 | +``` |
| 48 | + |
| 49 | +## Quickstart |
| 50 | + |
| 51 | +If you want to quickly run some examples, you can jump to [examples](#examples) section. You can also see this [XGBoost notebook example](../../examples/xgboost/notebooks/xgboost_abalone.ipynb) to see tornasole working. |
| 52 | + |
| 53 | +Integrating Tornasole into the training job can be accomplished by following steps below. |
| 54 | + |
| 55 | +### Import the Tornasole package |
| 56 | + |
| 57 | +Import the TornasoleHook class along with other helper classes in your training script as shown below |
| 58 | + |
| 59 | +``` |
| 60 | +from tornasole.xgboost import TornasoleHook |
| 61 | +from tornasole import SaveConfig |
| 62 | +``` |
| 63 | + |
| 64 | +### Instantiate and initialize tornasole hook |
| 65 | + |
| 66 | +``` |
| 67 | + # Create SaveConfig that instructs engine to log graph tensors every 10 steps. |
| 68 | + save_config = SaveConfig(save_interval=10) |
| 69 | + # Create a hook that logs evaluation metrics and feature importances while training the model. |
| 70 | + output_s3_uri = 's3://my_xgboost_training_debug_bucket/12345678-abcd-1234-abcd-1234567890ab' |
| 71 | + hook = TornasoleHook(out_dir=output_s3_uri, save_config=save_config) |
| 72 | +``` |
| 73 | + |
| 74 | +Using the *Collection* object and/or *include\_regex* parameter of TornasoleHook , users can control which tensors will be stored by the TornasoleHook. |
| 75 | +The section [How to save tensors](#how-to-save-tensors) explains various ways users can create *Collection* object to store the required tensors. |
| 76 | + |
| 77 | +The *SaveConfig* object controls when these tensors are stored. The tensors can be stored for specific steps or after certain interval of steps. If the *save\_config* parameter is not specified, the TornasoleHook will store tensors after every 100 steps. |
| 78 | + |
| 79 | +For additional details on TornasoleHook, SaveConfig and Collection please refer to the [API documentation](api.md) |
| 80 | + |
| 81 | +### Register Tornasole hook to the model before starting of the training. |
| 82 | + |
| 83 | +Users can use the hook as a callback function when training a booster. |
| 84 | + |
| 85 | +``` |
| 86 | +xgboost.train(params, dtrain, callbacks=[hook]) |
| 87 | +``` |
| 88 | + |
| 89 | + Examples |
| 90 | + |
| 91 | +### Tornasole local mode example |
| 92 | + |
| 93 | +The example [xgboost\_abalone\_basic\_hook\_demo.py](../../examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py) is implemented to show how Tornasole is useful in detecting when the evaluation metrics such as validation error stops decreasing. |
| 94 | + |
| 95 | +``` |
| 96 | +python3 examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py --tornasole_path ~/tornasole-testing/basic-demo/trial-one |
| 97 | +``` |
| 98 | + |
| 99 | +You can monitor the job by using [rules](../rules/README.md). For example, you |
| 100 | +can monitor if the metrics such as `train-rmse` or `validation-rmse` in the |
| 101 | +`metric` collection stopped decreasing by doing the following: |
| 102 | + |
| 103 | +``` |
| 104 | +python3 -m tornasole.rules.rule_invoker --trial-dir ~/tornasole-testing/basic-demo/trial-one --rule-name LossNotDecreasing --use_loss_collection False --collection_names 'metric' |
| 105 | +``` |
| 106 | + |
| 107 | +Note: You can also try some further analysis on tensors saved by following [programming model](../rules/README.md#the-programming-model) section of our Rules README. |
| 108 | + |
| 109 | +##### Tornasole S3 mode example |
| 110 | + |
| 111 | +``` |
| 112 | +python3 examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py --output_uri s3://tornasole-testing/basic-demo/trial-one |
| 113 | +``` |
| 114 | + |
| 115 | +You can monitor the job for non-decreasing metrics by doing the following: |
| 116 | + |
| 117 | +``` |
| 118 | +python3 -m tornasole.rules.rule_invoker --trial-dir s3://tornasole-testing/basic-demo/trial-one --rule-name LossNotDecreasing --use_loss_collection False --collection_names 'metric' |
| 119 | +``` |
| 120 | +Note: You can also try some further analysis on tensors saved by following [programming model](../rules/README.md#the-programming-model) section of our Rules README. |
| 121 | + |
| 122 | +## API |
| 123 | +Please refer to [this document](api.md) for description of all the functions and parameters that our APIs support. |
| 124 | + |
| 125 | +#### Hook |
| 126 | + |
| 127 | +TornasoleHook is the entry point for Tornasole into your program. |
| 128 | +Some key parameters to consider when creating the TornasoleHook are the following: |
| 129 | + |
| 130 | +- `out_dir`: This represents the path to which the outputs of tornasole will be written to under a directory with the name `out_dir`. This can be a local path or an S3 prefix of the form `s3://bucket_name/prefix`. |
| 131 | +- `save_config`: This is an object of [SaveConfig](#saveconfig). The SaveConfig allows user to specify when the tensors are to be stored. User can choose to specify the number of steps or the intervals of steps when the tensors will be stored. If not specified, it defaults to a SaveConfig which saves every 100 steps. |
| 132 | +- `include_collections`: This represents the [collections](#collection) to be saved. With this parameter, user can control which tensors are to be saved. |
| 133 | +- `include_regex`: This represents the regex patterns of names of tensors to save. With this parameter, user can control which tensors are to be saved. |
| 134 | + |
| 135 | +**Examples** |
| 136 | + |
| 137 | +- Save evaluation metrics and feature importances every 10 steps to an S3 location: |
| 138 | + |
| 139 | +``` |
| 140 | +import tornasole.xgboost as tx |
| 141 | +tx.TornasoleHook(out_dir='s3://tornasole-testing/trial_job_dir', |
| 142 | + save_config=SaveConfig(save_interval=10), |
| 143 | + include_collections=['metric', 'feature_importance']) |
| 144 | +``` |
| 145 | + |
| 146 | +- Save custom tensors by regex pattern to a local path |
| 147 | + |
| 148 | +``` |
| 149 | +import tornasole.xgboost as tx |
| 150 | +tx.TornasoleHook(out_dir='/home/ubuntu/tornasole-testing/trial_job_dir', |
| 151 | + include_regex=['validation*']) |
| 152 | +``` |
| 153 | + |
| 154 | +Refer [API](api.md) for all parameters available and detailed descriptions. |
| 155 | + |
| 156 | +#### Collection |
| 157 | + |
| 158 | +Collection object helps group tensors for easier handling of tensors being saved. |
| 159 | +A collection has its own list of tensors, include regex patterns, and [save config](#saveconfig). |
| 160 | +This allows setting of different save configs for different tensors. |
| 161 | +These collections are then also available during analysis. |
| 162 | +Tornasole will save the value of tensors in collection, if the collection is included in `include_collections` param of the [hook](#hook). |
| 163 | + |
| 164 | +Refer to [API](api.md) for all methods available when using collections such |
| 165 | +as setting SaveConfig for a specific collection or retrieving all collections. |
| 166 | + |
| 167 | +Please refer to [creating a collection](#creating-a-collection) to get overview of how to |
| 168 | +create collection and adding tensors to collection. |
| 169 | + |
| 170 | +#### SaveConfig |
| 171 | + |
| 172 | +SaveConfig class allows you to customize the frequency of saving tensors. |
| 173 | +The hook takes a SaveConfig object which is applied as |
| 174 | +default to all tensors included. |
| 175 | +A collection can also have its own SaveConfig object which is applied |
| 176 | +to the tensors belonging to that collection. |
| 177 | + |
| 178 | +SaveConfig also allows you to save tensors when certain tensors become nan. |
| 179 | +This list of tensors to watch for is taken as a list of strings representing names of tensors. |
| 180 | + |
| 181 | +The parameters taken by SaveConfig are: |
| 182 | + |
| 183 | +- `save_interval`: This allows you to save tensors every `n` steps |
| 184 | +- `save_steps`: Allows you to pass a list of step numbers at which tensors should be saved |
| 185 | + |
| 186 | +Refer to [API](api.md) for all parameters available and detailed descriptions for them, as well as example SaveConfig objects. |
| 187 | + |
| 188 | +#### ReductionConfig |
| 189 | + |
| 190 | +ReductionConfig is not currently used in XGBoost Tornasole. |
| 191 | +When Tornasole is used with deep learning frameworks, such as MXNet, |
| 192 | +Tensorflow, or PyTorch, ReductionConfig allows the saving of certain |
| 193 | +reductions of tensors instead of saving the full tensor. |
| 194 | +By reduction here we mean an operation that converts the tensor to a scalar. |
| 195 | +However, in XGBoost, we currently support evaluation metrics, feature |
| 196 | +importances, and average SHAP values, which are all scalars and not tensors. |
| 197 | +Therefore, if the `reduction_config` parameter is set in |
| 198 | +`tornasole.xgboost.TornasoleHook`, it will be ignored and not used at all. |
| 199 | + |
| 200 | +### How to save tensors |
| 201 | + |
| 202 | +There are different ways to save tensors when using Tornasole. |
| 203 | +Tornasole provides easy ways to save certain standard tensors by way of default |
| 204 | +collections (a Collection represents a group of tensors). |
| 205 | +Examples of such collections are 'metric', 'feature\_importance', |
| 206 | +'average\_shap', and 'default'. |
| 207 | +Besides the tensors in above default collections, you can save tensors by name or regex patterns on those names. |
| 208 | +This section will take you through these ways in more detail. |
| 209 | + |
| 210 | +#### Saving the tensors with *include\_regex* |
| 211 | +The TornasoleHook API supports *include\_regex* parameter. The users can specify a regex pattern with this pattern. The TornasoleHook will store the tensors that match with the specified regex pattern. With this approach, users can store the tensors without explicitly creating a Collection object. The specified regex pattern will be associated with 'default' Collection and the SaveConfig object that is associated with the 'default' collection. |
| 212 | + |
| 213 | +#### Default Collections |
| 214 | +Currently, the XGBoost TornasoleHook creates Collection objects for |
| 215 | +'metric', 'feature\_importance', 'average\_shap', and 'default'. These |
| 216 | +collections contain the regex pattern that match with |
| 217 | +evaluation metrics, feature importances, and SHAP values. The regex pattern for |
| 218 | +the 'default' collection is set when user specifies *include\_regex* with |
| 219 | +TornasoleHook or sets the *save_all=True*. These collections use the SaveConfig |
| 220 | +parameter provided with the TornasoleHook initialization. The TornasoleHook |
| 221 | +will store the related tensors, if user does not specify any special collection |
| 222 | +with *include\_collections* parameter. If user specifies a collection with |
| 223 | +*include\_collections* the above default collections will not be in effect. |
| 224 | +Please refer to [this document](api.md) for description of all the default= |
| 225 | +collections. |
| 226 | + |
| 227 | +#### Custom Collections |
| 228 | + |
| 229 | +You can also create any other customized collection yourself. |
| 230 | +You can create new collections as well as modify existing collections |
| 231 | + |
| 232 | +##### Creating a collection |
| 233 | + |
| 234 | +Each collection should have a unique name (which is a string). You can create |
| 235 | +collections by invoking helper methods as described in the [API](api.md) documentation |
| 236 | + |
| 237 | +``` |
| 238 | +from tornasole.xgboost as get_collection |
| 239 | +get_collection('metric').include(['validation-auc']) |
| 240 | +``` |
| 241 | + |
| 242 | +##### Adding tensors |
| 243 | + |
| 244 | +Tensors can be added to a collection by either passing an include regex parameter to the collection. |
| 245 | +If you don't know the name of the tensors you want to add, you can also add the tensors to the collection |
| 246 | +by the variables representing the tensors in code. The following sections describe these two scenarios. |
| 247 | + |
| 248 | +##### Adding tensors by regex |
| 249 | +If you know the name of the tensors you want to save and can write regex |
| 250 | +patterns to match those tensornames, you can pass the regex patterns to the collection. |
| 251 | +The tensors which match these patterns are included and added to the collection. |
| 252 | + |
| 253 | +``` |
| 254 | +from tornasole.xgboost import get_collection |
| 255 | +get_collection('metric').include(["train*", "*-auc"]) |
| 256 | +``` |
| 257 | + |
| 258 | +#### Saving All Tensors |
| 259 | +Tornasole makes it easy to save all the tensors in the model. You just need to set the flag `save_all=True` when creating the hook. This creates a collection named 'all' and saves all the tensors under that collection. |
| 260 | +**NOTE : Storing all the tensors will slow down the training and will increase the storage consumption.** |
| 261 | + |
| 262 | + |
| 263 | +### More Examples |
| 264 | + |
| 265 | +| Example Type | Logging Evluation Metrics | |
| 266 | +| -------------- | ------------------------ | |
| 267 | +| Link to Example | [xgboost\_abalone\_basic\_hook\_demo.py](../../examples/xgboost/scripts/xgbost_abalone_basic_hook_demo.py) | |
| 268 | + |
| 269 | +#### Logging evaluation metrics and feature importances of the model |
| 270 | + |
| 271 | +The [xgboost\_abalone\_basic\_hook\_demo.py](../../examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py) shows end to end example of how to create and register Tornasole hook that can log performance metrics, feature importances, and SHAP values. |
| 272 | + |
| 273 | +Here is how to create a hook for this purpose: |
| 274 | + |
| 275 | +``` |
| 276 | +# Create a tornasole hook. The initialization of hook determines which tensors |
| 277 | +# are logged while training is in progress. |
| 278 | +# Following function shows the default initialization that enables logging of |
| 279 | +# evaluation metrics, feature importances, and SHAP values. |
| 280 | +def create_tornasole_hook(output_s3_uri, shap_data=None): |
| 281 | +
|
| 282 | + save_config = SaveConfig(save_interval=5) |
| 283 | + hook = TornasoleHook( |
| 284 | + out_dir=output_s3_uri, |
| 285 | + save_config=save_config, |
| 286 | + shap_data=shap_data) |
| 287 | +
|
| 288 | + return hook |
| 289 | +``` |
| 290 | + |
| 291 | +Here is how to use the hook as a callback function: |
| 292 | + |
| 293 | +``` |
| 294 | + bst = xgboost.train( |
| 295 | + params=params, dtrain=dtrain, |
| 296 | + ... |
| 297 | + callbacks=[hook]) |
| 298 | +``` |
| 299 | + |
| 300 | +The example can be invoked as shown below. **Ensure that the s3 bucket specified in command line is accessible for read and write operations** |
| 301 | + |
| 302 | +``` |
| 303 | +python3 examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py --output_uri s3://tornasole-testing/basic-xgboost-hook |
| 304 | +``` |
| 305 | + |
| 306 | +For detail command line help run |
| 307 | + |
| 308 | +``` |
| 309 | +python3 examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py --help |
| 310 | +``` |
| 311 | + |
| 312 | + |
| 313 | +## Analyzing the Results |
| 314 | + |
| 315 | +This library enables users to collect the desired tensors at desired frequency |
| 316 | +while XGBoost training job is running. |
| 317 | +The tensor data generated during this job can be analyzed with various |
| 318 | +rules that check for performance metrics, feature importances, etc. |
| 319 | +For example, the performance metrics generated in |
| 320 | +[xgboost_abalone.ipynb](../../examples/xgboost/notebooks/xgboost_abalone.ipynb) |
| 321 | +are analyzed by 'LossNotDecreasing' rule, which shows the number of performance |
| 322 | +metrics that are not decreasing at regular step intervals. |
| 323 | + |
| 324 | +``` |
| 325 | +python3 -m tornasole.rules.rule_invoker --trial-dir s3://tornasole-testing/basic-demo/trial-one --rule-name LossNotDecreasing --use_loss_collection False --collection_names 'metric' |
| 326 | +``` |
| 327 | + |
| 328 | +For details regarding how to analyze the tensor data, usage of existing rules or writing new rules, |
| 329 | +please refer to [Rules documentation](../rules/README.md). |
| 330 | + |
| 331 | + |
| 332 | +## FAQ |
| 333 | +#### Logging |
| 334 | +You can control the logging from Tornasole by setting the appropriate |
| 335 | +level for the python logger `tornasole` using either of the following approaches. |
| 336 | + |
| 337 | +**In Python code** |
| 338 | +``` |
| 339 | +import logging |
| 340 | +logging.getLogger('tornasole').setLevel = logging.INFO |
| 341 | +``` |
| 342 | + |
| 343 | +**Using environment variable** |
| 344 | +You can also set the environment variable `TORNASOLE_LOG_LEVEL` as below |
| 345 | + |
| 346 | +``` |
| 347 | +export TORNASOLE_LOG_LEVEL=INFO |
| 348 | +``` |
| 349 | +Log levels available are 'INFO', 'DEBUG', 'WARNING', 'ERROR', 'CRITICAL', 'OFF'. |
| 350 | + |
| 351 | +#### S3Access |
| 352 | +The instance running tornasole in s3 mode needs to have s3 access. There are different ways to provide an instance to your s3 account. |
| 353 | +- If you using EC2 instance, you should launch your instance with proper iam role to access s3. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html |
| 354 | +- If you are using mac or other machine, you can create a IAM user for your account to have s3 access by following this guide (https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) and then configure your instance to use your AWS_ACCESS_KEY_ID AND AWS_SECRET_KEY_ID by using doc here https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html |
| 355 | +- Once you are done configuring, please verify that below is working and buckets returned are from the account and region you want to use. |
| 356 | +``` |
| 357 | +aws s3 ls |
| 358 | +``` |
| 359 | + |
| 360 | +## ContactUs |
| 361 | +We would like to hear from you. If you have any question or feedback, please reach out to us [email protected] |
| 362 | + |
| 363 | +## License |
| 364 | +This library is licensed under the Apache 2.0 License. |
0 commit comments