Skip to content

Commit 8399717

Browse files
Edward J Kimrahul003
authored andcommitted
Add readme, api docs and notebooks for xgboost hook (aws#179)
* Add readme and api docs for xgboost hook * Update sagemaker notebook * Update XGBoost docs and notebooks * Remove modes section
1 parent 3daeb5a commit 8399717

File tree

6 files changed

+3089
-0
lines changed

6 files changed

+3089
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ Please follow the appropriate Readme page to install the correct version.
1919
#### [Tornasole TensorFlow](docs/tensorflow/README.md)
2020
#### [Tornasole MXNet](docs/mxnet/README.md)
2121
#### [Tornasole PyTorch](docs/pytorch/README.md)
22+
#### [Tornasole XGBoost](docs/xgboost/README.md)
2223

2324
### Analysis
2425
Please refer **[this page](docs/rules/README.md)** for more details about how to analyze.

docs/xgboost/README.md

Lines changed: 364 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,364 @@
1+
# Tornasole for XGBoost
2+
3+
Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training.
4+
5+
Using Tornasole is a two step process:
6+
7+
**Saving tensors**
8+
This needs the `tornasole` package built for the appropriate framework. This package lets you collect the tensors you want at the frequency that you want, and save them for analysis.
9+
Please follow the appropriate Readme page to install the correct version. This page is for using Tornasole with XGBoost.
10+
11+
**Analysis**
12+
Please refer to [this page](../rules/README.md) for more details about how to run rules and other analysis
13+
on tensors collection from the job. That said, we do provide a few example analysis commands below
14+
so as to provide an end to end flow. The analysis of these tensors can be done on a separate machine
15+
in parallel with the training job.
16+
17+
## Installation
18+
19+
#### Prerequisites
20+
21+
- **Python 3.6**
22+
- Tornasole can work in local mode or remote(s3) mode. You can skip this, if you want to try [local mode example](#tornasole-local-mode-example).
23+
This is necessary to setup if you want to try [s3 mode example](#tornasole-s3-mode-example).
24+
For running in S3 mode, you need to make sure that instance you are using has proper credentials set to have S3 write access.
25+
Try the below command -
26+
```
27+
aws s3 ls
28+
```
29+
If you see errors, then most probably your credentials are not properly set.
30+
Please follow [FAQ on S3](#s3access) to make sure that your instance has proper S3 access.
31+
32+
#### Instructions
33+
34+
**Make sure that your aws account is whitelisted for Tornasole. [ContactUs](#contactus)**.
35+
36+
Once your account is whitelisted, you should be able to install the `tornasole` package built for XGBoost as follows:
37+
38+
```
39+
aws s3 sync s3://tornasole-binaries-use1/tornasole_xgboost/py3/latest/ tornasole_xgboost/
40+
pip install tornasole_xgboost/tornasole-*
41+
```
42+
43+
**Please note** : If, while installing tornasole, you get a version conflict issue between botocore and boto3,
44+
you might need to run the following
45+
```
46+
pip uninstall -y botocore boto3 aioboto3 aiobotocore && pip install botocore==1.12.91 boto3==1.9.91 aiobotocore==0.10.2 aioboto3==6.4.1
47+
```
48+
49+
## Quickstart
50+
51+
If you want to quickly run some examples, you can jump to [examples](#examples) section. You can also see this [XGBoost notebook example](../../examples/xgboost/notebooks/xgboost_abalone.ipynb) to see tornasole working.
52+
53+
Integrating Tornasole into the training job can be accomplished by following steps below.
54+
55+
### Import the Tornasole package
56+
57+
Import the TornasoleHook class along with other helper classes in your training script as shown below
58+
59+
```
60+
from tornasole.xgboost import TornasoleHook
61+
from tornasole import SaveConfig
62+
```
63+
64+
### Instantiate and initialize tornasole hook
65+
66+
```
67+
# Create SaveConfig that instructs engine to log graph tensors every 10 steps.
68+
save_config = SaveConfig(save_interval=10)
69+
# Create a hook that logs evaluation metrics and feature importances while training the model.
70+
output_s3_uri = 's3://my_xgboost_training_debug_bucket/12345678-abcd-1234-abcd-1234567890ab'
71+
hook = TornasoleHook(out_dir=output_s3_uri, save_config=save_config)
72+
```
73+
74+
Using the *Collection* object and/or *include\_regex* parameter of TornasoleHook , users can control which tensors will be stored by the TornasoleHook.
75+
The section [How to save tensors](#how-to-save-tensors) explains various ways users can create *Collection* object to store the required tensors.
76+
77+
The *SaveConfig* object controls when these tensors are stored. The tensors can be stored for specific steps or after certain interval of steps. If the *save\_config* parameter is not specified, the TornasoleHook will store tensors after every 100 steps.
78+
79+
For additional details on TornasoleHook, SaveConfig and Collection please refer to the [API documentation](api.md)
80+
81+
### Register Tornasole hook to the model before starting of the training.
82+
83+
Users can use the hook as a callback function when training a booster.
84+
85+
```
86+
xgboost.train(params, dtrain, callbacks=[hook])
87+
```
88+
89+
Examples
90+
91+
### Tornasole local mode example
92+
93+
The example [xgboost\_abalone\_basic\_hook\_demo.py](../../examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py) is implemented to show how Tornasole is useful in detecting when the evaluation metrics such as validation error stops decreasing.
94+
95+
```
96+
python3 examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py --tornasole_path ~/tornasole-testing/basic-demo/trial-one
97+
```
98+
99+
You can monitor the job by using [rules](../rules/README.md). For example, you
100+
can monitor if the metrics such as `train-rmse` or `validation-rmse` in the
101+
`metric` collection stopped decreasing by doing the following:
102+
103+
```
104+
python3 -m tornasole.rules.rule_invoker --trial-dir ~/tornasole-testing/basic-demo/trial-one --rule-name LossNotDecreasing --use_loss_collection False --collection_names 'metric'
105+
```
106+
107+
Note: You can also try some further analysis on tensors saved by following [programming model](../rules/README.md#the-programming-model) section of our Rules README.
108+
109+
##### Tornasole S3 mode example
110+
111+
```
112+
python3 examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py --output_uri s3://tornasole-testing/basic-demo/trial-one
113+
```
114+
115+
You can monitor the job for non-decreasing metrics by doing the following:
116+
117+
```
118+
python3 -m tornasole.rules.rule_invoker --trial-dir s3://tornasole-testing/basic-demo/trial-one --rule-name LossNotDecreasing --use_loss_collection False --collection_names 'metric'
119+
```
120+
Note: You can also try some further analysis on tensors saved by following [programming model](../rules/README.md#the-programming-model) section of our Rules README.
121+
122+
## API
123+
Please refer to [this document](api.md) for description of all the functions and parameters that our APIs support.
124+
125+
#### Hook
126+
127+
TornasoleHook is the entry point for Tornasole into your program.
128+
Some key parameters to consider when creating the TornasoleHook are the following:
129+
130+
- `out_dir`: This represents the path to which the outputs of tornasole will be written to under a directory with the name `out_dir`. This can be a local path or an S3 prefix of the form `s3://bucket_name/prefix`.
131+
- `save_config`: This is an object of [SaveConfig](#saveconfig). The SaveConfig allows user to specify when the tensors are to be stored. User can choose to specify the number of steps or the intervals of steps when the tensors will be stored. If not specified, it defaults to a SaveConfig which saves every 100 steps.
132+
- `include_collections`: This represents the [collections](#collection) to be saved. With this parameter, user can control which tensors are to be saved.
133+
- `include_regex`: This represents the regex patterns of names of tensors to save. With this parameter, user can control which tensors are to be saved.
134+
135+
**Examples**
136+
137+
- Save evaluation metrics and feature importances every 10 steps to an S3 location:
138+
139+
```
140+
import tornasole.xgboost as tx
141+
tx.TornasoleHook(out_dir='s3://tornasole-testing/trial_job_dir',
142+
save_config=SaveConfig(save_interval=10),
143+
include_collections=['metric', 'feature_importance'])
144+
```
145+
146+
- Save custom tensors by regex pattern to a local path
147+
148+
```
149+
import tornasole.xgboost as tx
150+
tx.TornasoleHook(out_dir='/home/ubuntu/tornasole-testing/trial_job_dir',
151+
include_regex=['validation*'])
152+
```
153+
154+
Refer [API](api.md) for all parameters available and detailed descriptions.
155+
156+
#### Collection
157+
158+
Collection object helps group tensors for easier handling of tensors being saved.
159+
A collection has its own list of tensors, include regex patterns, and [save config](#saveconfig).
160+
This allows setting of different save configs for different tensors.
161+
These collections are then also available during analysis.
162+
Tornasole will save the value of tensors in collection, if the collection is included in `include_collections` param of the [hook](#hook).
163+
164+
Refer to [API](api.md) for all methods available when using collections such
165+
as setting SaveConfig for a specific collection or retrieving all collections.
166+
167+
Please refer to [creating a collection](#creating-a-collection) to get overview of how to
168+
create collection and adding tensors to collection.
169+
170+
#### SaveConfig
171+
172+
SaveConfig class allows you to customize the frequency of saving tensors.
173+
The hook takes a SaveConfig object which is applied as
174+
default to all tensors included.
175+
A collection can also have its own SaveConfig object which is applied
176+
to the tensors belonging to that collection.
177+
178+
SaveConfig also allows you to save tensors when certain tensors become nan.
179+
This list of tensors to watch for is taken as a list of strings representing names of tensors.
180+
181+
The parameters taken by SaveConfig are:
182+
183+
- `save_interval`: This allows you to save tensors every `n` steps
184+
- `save_steps`: Allows you to pass a list of step numbers at which tensors should be saved
185+
186+
Refer to [API](api.md) for all parameters available and detailed descriptions for them, as well as example SaveConfig objects.
187+
188+
#### ReductionConfig
189+
190+
ReductionConfig is not currently used in XGBoost Tornasole.
191+
When Tornasole is used with deep learning frameworks, such as MXNet,
192+
Tensorflow, or PyTorch, ReductionConfig allows the saving of certain
193+
reductions of tensors instead of saving the full tensor.
194+
By reduction here we mean an operation that converts the tensor to a scalar.
195+
However, in XGBoost, we currently support evaluation metrics, feature
196+
importances, and average SHAP values, which are all scalars and not tensors.
197+
Therefore, if the `reduction_config` parameter is set in
198+
`tornasole.xgboost.TornasoleHook`, it will be ignored and not used at all.
199+
200+
### How to save tensors
201+
202+
There are different ways to save tensors when using Tornasole.
203+
Tornasole provides easy ways to save certain standard tensors by way of default
204+
collections (a Collection represents a group of tensors).
205+
Examples of such collections are 'metric', 'feature\_importance',
206+
'average\_shap', and 'default'.
207+
Besides the tensors in above default collections, you can save tensors by name or regex patterns on those names.
208+
This section will take you through these ways in more detail.
209+
210+
#### Saving the tensors with *include\_regex*
211+
The TornasoleHook API supports *include\_regex* parameter. The users can specify a regex pattern with this pattern. The TornasoleHook will store the tensors that match with the specified regex pattern. With this approach, users can store the tensors without explicitly creating a Collection object. The specified regex pattern will be associated with 'default' Collection and the SaveConfig object that is associated with the 'default' collection.
212+
213+
#### Default Collections
214+
Currently, the XGBoost TornasoleHook creates Collection objects for
215+
'metric', 'feature\_importance', 'average\_shap', and 'default'. These
216+
collections contain the regex pattern that match with
217+
evaluation metrics, feature importances, and SHAP values. The regex pattern for
218+
the 'default' collection is set when user specifies *include\_regex* with
219+
TornasoleHook or sets the *save_all=True*. These collections use the SaveConfig
220+
parameter provided with the TornasoleHook initialization. The TornasoleHook
221+
will store the related tensors, if user does not specify any special collection
222+
with *include\_collections* parameter. If user specifies a collection with
223+
*include\_collections* the above default collections will not be in effect.
224+
Please refer to [this document](api.md) for description of all the default=
225+
collections.
226+
227+
#### Custom Collections
228+
229+
You can also create any other customized collection yourself.
230+
You can create new collections as well as modify existing collections
231+
232+
##### Creating a collection
233+
234+
Each collection should have a unique name (which is a string). You can create
235+
collections by invoking helper methods as described in the [API](api.md) documentation
236+
237+
```
238+
from tornasole.xgboost as get_collection
239+
get_collection('metric').include(['validation-auc'])
240+
```
241+
242+
##### Adding tensors
243+
244+
Tensors can be added to a collection by either passing an include regex parameter to the collection.
245+
If you don't know the name of the tensors you want to add, you can also add the tensors to the collection
246+
by the variables representing the tensors in code. The following sections describe these two scenarios.
247+
248+
##### Adding tensors by regex
249+
If you know the name of the tensors you want to save and can write regex
250+
patterns to match those tensornames, you can pass the regex patterns to the collection.
251+
The tensors which match these patterns are included and added to the collection.
252+
253+
```
254+
from tornasole.xgboost import get_collection
255+
get_collection('metric').include(["train*", "*-auc"])
256+
```
257+
258+
#### Saving All Tensors
259+
Tornasole makes it easy to save all the tensors in the model. You just need to set the flag `save_all=True` when creating the hook. This creates a collection named 'all' and saves all the tensors under that collection.
260+
**NOTE : Storing all the tensors will slow down the training and will increase the storage consumption.**
261+
262+
263+
### More Examples
264+
265+
| Example Type | Logging Evluation Metrics |
266+
| -------------- | ------------------------ |
267+
| Link to Example | [xgboost\_abalone\_basic\_hook\_demo.py](../../examples/xgboost/scripts/xgbost_abalone_basic_hook_demo.py) |
268+
269+
#### Logging evaluation metrics and feature importances of the model
270+
271+
The [xgboost\_abalone\_basic\_hook\_demo.py](../../examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py) shows end to end example of how to create and register Tornasole hook that can log performance metrics, feature importances, and SHAP values.
272+
273+
Here is how to create a hook for this purpose:
274+
275+
```
276+
# Create a tornasole hook. The initialization of hook determines which tensors
277+
# are logged while training is in progress.
278+
# Following function shows the default initialization that enables logging of
279+
# evaluation metrics, feature importances, and SHAP values.
280+
def create_tornasole_hook(output_s3_uri, shap_data=None):
281+
282+
save_config = SaveConfig(save_interval=5)
283+
hook = TornasoleHook(
284+
out_dir=output_s3_uri,
285+
save_config=save_config,
286+
shap_data=shap_data)
287+
288+
return hook
289+
```
290+
291+
Here is how to use the hook as a callback function:
292+
293+
```
294+
bst = xgboost.train(
295+
params=params, dtrain=dtrain,
296+
...
297+
callbacks=[hook])
298+
```
299+
300+
The example can be invoked as shown below. **Ensure that the s3 bucket specified in command line is accessible for read and write operations**
301+
302+
```
303+
python3 examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py --output_uri s3://tornasole-testing/basic-xgboost-hook
304+
```
305+
306+
For detail command line help run
307+
308+
```
309+
python3 examples/xgboost/scripts/xgboost_abalone_basic_hook_demo.py --help
310+
```
311+
312+
313+
## Analyzing the Results
314+
315+
This library enables users to collect the desired tensors at desired frequency
316+
while XGBoost training job is running.
317+
The tensor data generated during this job can be analyzed with various
318+
rules that check for performance metrics, feature importances, etc.
319+
For example, the performance metrics generated in
320+
[xgboost_abalone.ipynb](../../examples/xgboost/notebooks/xgboost_abalone.ipynb)
321+
are analyzed by 'LossNotDecreasing' rule, which shows the number of performance
322+
metrics that are not decreasing at regular step intervals.
323+
324+
```
325+
python3 -m tornasole.rules.rule_invoker --trial-dir s3://tornasole-testing/basic-demo/trial-one --rule-name LossNotDecreasing --use_loss_collection False --collection_names 'metric'
326+
```
327+
328+
For details regarding how to analyze the tensor data, usage of existing rules or writing new rules,
329+
please refer to [Rules documentation](../rules/README.md).
330+
331+
332+
## FAQ
333+
#### Logging
334+
You can control the logging from Tornasole by setting the appropriate
335+
level for the python logger `tornasole` using either of the following approaches.
336+
337+
**In Python code**
338+
```
339+
import logging
340+
logging.getLogger('tornasole').setLevel = logging.INFO
341+
```
342+
343+
**Using environment variable**
344+
You can also set the environment variable `TORNASOLE_LOG_LEVEL` as below
345+
346+
```
347+
export TORNASOLE_LOG_LEVEL=INFO
348+
```
349+
Log levels available are 'INFO', 'DEBUG', 'WARNING', 'ERROR', 'CRITICAL', 'OFF'.
350+
351+
#### S3Access
352+
The instance running tornasole in s3 mode needs to have s3 access. There are different ways to provide an instance to your s3 account.
353+
- If you using EC2 instance, you should launch your instance with proper iam role to access s3. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
354+
- If you are using mac or other machine, you can create a IAM user for your account to have s3 access by following this guide (https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) and then configure your instance to use your AWS_ACCESS_KEY_ID AND AWS_SECRET_KEY_ID by using doc here https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
355+
- Once you are done configuring, please verify that below is working and buckets returned are from the account and region you want to use.
356+
```
357+
aws s3 ls
358+
```
359+
360+
## ContactUs
361+
We would like to hear from you. If you have any question or feedback, please reach out to us [email protected]
362+
363+
## License
364+
This library is licensed under the Apache 2.0 License.

0 commit comments

Comments
 (0)