Skip to content

Commit c9bd1fc

Browse files
committed
add mtry interpretation docs
1 parent 64e7125 commit c9bd1fc

File tree

7 files changed

+69
-12
lines changed

7 files changed

+69
-12
lines changed

man/rmd/boost_tree_lightgbm.Rmd

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,11 @@ boost_tree(
6969

7070
Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric.
7171

72+
### Interpreting `mtry`
73+
74+
```{r child = "template-mtry-prop.Rmd"}
75+
```
76+
7277
### Verbosity
7378

7479
bonsai quiets much of the logging output from [lightgbm::lgb.train()] by default. With default settings, logged warnings and errors will still be passed on to the user. To print out all logs during training, set `quiet = TRUE`.

man/rmd/boost_tree_lightgbm.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Note that parsnip's translation can be overridden via the `counts` argument, sup
2929

3030
## Translation from parsnip to the original package (regression)
3131

32-
32+
The **bonsai** extension package is required to fit this model.
3333

3434

3535
```r
@@ -59,12 +59,13 @@ boost_tree(
5959
## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(),
6060
## feature_fraction = integer(), num_iterations = integer(),
6161
## min_data_in_leaf = integer(), max_depth = integer(), learning_rate = numeric(),
62-
## min_gain_to_split = numeric(), verbose = -1)
62+
## min_gain_to_split = numeric(), verbose = -1, num_threads = 0,
63+
## seed = sample.int(10^5, 1), deterministic = TRUE)
6364
```
6465

6566
## Translation from parsnip to the original package (classification)
6667

67-
68+
The **bonsai** extension package is required to fit this model.
6869

6970

7071
```r
@@ -94,7 +95,8 @@ boost_tree(
9495
## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(),
9596
## feature_fraction = integer(), num_iterations = integer(),
9697
## min_data_in_leaf = integer(), max_depth = integer(), learning_rate = numeric(),
97-
## min_gain_to_split = numeric(), verbose = -1)
98+
## min_gain_to_split = numeric(), verbose = -1, num_threads = 0,
99+
## seed = sample.int(10^5, 1), deterministic = TRUE)
98100
```
99101

100102
[train_lightgbm()] is a wrapper around [lightgbm::lgb.train()] (and other functions) that make it easier to run this model.
@@ -108,6 +110,17 @@ This engine does not require any special encoding of the predictors. Categorical
108110

109111
Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric.
110112

113+
### Interpreting `mtry`
114+
115+
116+
The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.
117+
118+
Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
119+
120+
parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
121+
122+
`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.
123+
111124
### Verbosity
112125

113126
bonsai quiets much of the logging output from [lightgbm::lgb.train()] by default. With default settings, logged warnings and errors will still be passed on to the user. To print out all logs during training, set `quiet = TRUE`.

man/rmd/boost_tree_xgboost.Rmd

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,6 @@ This model has `r nrow(param)` tuning parameters:
2424
param$item
2525
```
2626

27-
The `mtry` parameter is related to the number of predictors. The default is to use all predictors. [xgboost::xgb.train()] encodes this as a real number between zero and one. parsnip translates the number of columns to this type of value. The user should give the argument to `boost_tree()` as an integer (not a real number).
28-
2927
## Translation from parsnip to the original package (regression)
3028

3129
```{r xgboost-reg}
@@ -70,6 +68,11 @@ xgboost requires the data to be in a sparse format. If your predictor data are a
7068

7169
By default, the model is trained without parallel processing. This can be change by passing the `nthread` parameter to [set_engine()]. However, it is unwise to combine this with external parallel processing when using the \pkg{tune} package.
7270

71+
### Interpreting `mtry`
72+
73+
```{r child = "template-mtry-prop.Rmd"}
74+
```
75+
7376
### Early stopping
7477

7578
The `stop_iter()` argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations.

man/rmd/boost_tree_xgboost.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,6 @@ This model has 8 tuning parameters:
2525

2626
- `stop_iter`: # Iterations Before Stopping (type: integer, default: Inf)
2727

28-
The `mtry` parameter is related to the number of predictors. The default is to use all predictors. [xgboost::xgb.train()] encodes this as a real number between zero and one. parsnip translates the number of columns to this type of value. The user should give the argument to `boost_tree()` as an integer (not a real number).
29-
3028
## Translation from parsnip to the original package (regression)
3129

3230

@@ -117,6 +115,17 @@ xgboost requires the data to be in a sparse format. If your predictor data are a
117115

118116
By default, the model is trained without parallel processing. This can be change by passing the `nthread` parameter to [set_engine()]. However, it is unwise to combine this with external parallel processing when using the \pkg{tune} package.
119117

118+
### Interpreting `mtry`
119+
120+
121+
The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.
122+
123+
Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
124+
125+
parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
126+
127+
`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.
128+
120129
### Early stopping
121130

122131
The `stop_iter()` argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations.

man/rmd/rule_fit_xrf.Rmd

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,11 +83,18 @@ Also, there are several configuration differences in how `xrf()` is fit between
8383

8484
These differences will create a disparity in the values of the `penalty` argument that **glmnet** uses. Also, **rules** can also set `penalty` whereas **xrf** uses an internal 5-fold cross-validation to determine it (by default).
8585

86-
## Preprocessing requirements
86+
## Other details
87+
88+
### Preprocessing requirements
8789

8890
```{r child = "template-makes-dummies.Rmd"}
8991
```
9092

93+
### Interpreting `mtry`
94+
95+
```{r child = "template-mtry-prop.Rmd"}
96+
```
97+
9198
## References
9299

93100
- Friedman and Popescu. "Predictive learning via rule ensembles." Ann. Appl. Stat. 2 (3) 916- 954, September 2008

man/rmd/rule_fit_xrf.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ rule_fit(
6565
## Computational engine: xrf
6666
##
6767
## Model fit template:
68-
## rules::xrf_fit(object = missing_arg(), data = missing_arg(),
68+
## rules::xrf_fit(formula = missing_arg(), data = missing_arg(),
6969
## colsample_bytree = numeric(1), nrounds = integer(1), min_child_weight = integer(1),
7070
## max_depth = integer(1), eta = numeric(1), gamma = numeric(1),
7171
## subsample = numeric(1), lambda = numeric(1))
@@ -111,7 +111,7 @@ rule_fit(
111111
## Computational engine: xrf
112112
##
113113
## Model fit template:
114-
## rules::xrf_fit(object = missing_arg(), data = missing_arg(),
114+
## rules::xrf_fit(formula = missing_arg(), data = missing_arg(),
115115
## colsample_bytree = numeric(1), nrounds = integer(1), min_child_weight = integer(1),
116116
## max_depth = integer(1), eta = numeric(1), gamma = numeric(1),
117117
## subsample = numeric(1), lambda = numeric(1))
@@ -132,11 +132,24 @@ Also, there are several configuration differences in how `xrf()` is fit between
132132

133133
These differences will create a disparity in the values of the `penalty` argument that **glmnet** uses. Also, **rules** can also set `penalty` whereas **xrf** uses an internal 5-fold cross-validation to determine it (by default).
134134

135-
## Preprocessing requirements
135+
## Other details
136+
137+
### Preprocessing requirements
136138

137139

138140
Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine. When using the formula method via \\code{\\link[=fit.model_spec]{fit()}}, parsnip will convert factor columns to indicators.
139141

142+
### Interpreting `mtry`
143+
144+
145+
The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.
146+
147+
Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
148+
149+
parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
150+
151+
`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.
152+
140153
## References
141154

142155
- Friedman and Popescu. "Predictive learning via rule ensembles." Ann. Appl. Stat. 2 (3) 916- 954, September 2008

man/rmd/template-mtry-prop.Rmd

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.
2+
3+
Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
4+
5+
parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
6+
7+
`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.

0 commit comments

Comments
 (0)