add mtry interpretation docs

simonpcouch · simonpcouch · commit c9bd1fcacb20 · 2022-05-25T16:48:47.000-04:00
diff --git a/man/rmd/boost_tree_lightgbm.Rmd b/man/rmd/boost_tree_lightgbm.Rmd
@@ -69,6 +69,11 @@ boost_tree(
 
 Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric. 
 
+### Interpreting `mtry`
+
+```{r child = "template-mtry-prop.Rmd"}
+```
+
 ### Verbosity
 
 bonsai quiets much of the logging output from [lightgbm::lgb.train()] by default. With default settings, logged warnings and errors will still be passed on to the user. To print out all logs during training, set `quiet = TRUE`.
diff --git a/man/rmd/boost_tree_lightgbm.md b/man/rmd/boost_tree_lightgbm.md
@@ -29,7 +29,7 @@ Note that parsnip's translation can be overridden via the `counts` argument, sup
 
 ## Translation from parsnip to the original package (regression)
 
-
+The **bonsai** extension package is required to fit this model.
 
 
 ```r
@@ -59,12 +59,13 @@ boost_tree(
 ## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(), 
 ##     feature_fraction = integer(), num_iterations = integer(), 
 ##     min_data_in_leaf = integer(), max_depth = integer(), learning_rate = numeric(), 
-##     min_gain_to_split = numeric(), verbose = -1)
+##     min_gain_to_split = numeric(), verbose = -1, num_threads = 0, 
+##     seed = sample.int(10^5, 1), deterministic = TRUE)
 ```
 
 ## Translation from parsnip to the original package (classification)
 
-
+The **bonsai** extension package is required to fit this model.
 
 
 ```r
@@ -94,7 +95,8 @@ boost_tree(
 ## bonsai::train_lightgbm(x = missing_arg(), y = missing_arg(), 
 ##     feature_fraction = integer(), num_iterations = integer(), 
 ##     min_data_in_leaf = integer(), max_depth = integer(), learning_rate = numeric(), 
-##     min_gain_to_split = numeric(), verbose = -1)
+##     min_gain_to_split = numeric(), verbose = -1, num_threads = 0, 
+##     seed = sample.int(10^5, 1), deterministic = TRUE)
 ```
 
 [train_lightgbm()] is a wrapper around [lightgbm::lgb.train()] (and other functions) that make it easier to run this model. 
@@ -108,6 +110,17 @@ This engine does not require any special encoding of the predictors. Categorical
 
 Non-numeric predictors (i.e., factors) are internally converted to numeric. In the classification context, non-numeric outcomes (i.e., factors) are also internally converted to numeric. 
 
+### Interpreting `mtry`
+
+
+The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models. 
+
+Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
+
+parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
+
+`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.
+
 ### Verbosity
 
 bonsai quiets much of the logging output from [lightgbm::lgb.train()] by default. With default settings, logged warnings and errors will still be passed on to the user. To print out all logs during training, set `quiet = TRUE`.
diff --git a/man/rmd/boost_tree_xgboost.Rmd b/man/rmd/boost_tree_xgboost.Rmd
@@ -24,8 +24,6 @@ This model has `r nrow(param)` tuning parameters:
 param$item
 ```
 
-The `mtry` parameter is related to the number of predictors. The default is to use all predictors. [xgboost::xgb.train()] encodes this as a real number between zero and one. parsnip translates the number of columns to this type of value. The user should give the argument to `boost_tree()` as an integer (not a real number). 
-
 ## Translation from parsnip to the original package (regression)
 
 ```{r xgboost-reg}
@@ -70,6 +68,11 @@ xgboost requires the data to be in a sparse format. If your predictor data are a
 
 By default, the model is trained without parallel processing. This can be change by passing the `nthread` parameter to [set_engine()]. However, it is unwise to combine this with external parallel processing when using the \pkg{tune} package. 
 
+### Interpreting `mtry`
+
+```{r child = "template-mtry-prop.Rmd"}
+```
+
 ### Early stopping
 
 The `stop_iter()`  argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations. 
diff --git a/man/rmd/boost_tree_xgboost.md b/man/rmd/boost_tree_xgboost.md
@@ -25,8 +25,6 @@ This model has 8 tuning parameters:
 
 - `stop_iter`: # Iterations Before Stopping (type: integer, default: Inf)
 
-The `mtry` parameter is related to the number of predictors. The default is to use all predictors. [xgboost::xgb.train()] encodes this as a real number between zero and one. parsnip translates the number of columns to this type of value. The user should give the argument to `boost_tree()` as an integer (not a real number). 
-
 ## Translation from parsnip to the original package (regression)
 
 
@@ -117,6 +115,17 @@ xgboost requires the data to be in a sparse format. If your predictor data are a
 
 By default, the model is trained without parallel processing. This can be change by passing the `nthread` parameter to [set_engine()]. However, it is unwise to combine this with external parallel processing when using the \pkg{tune} package. 
 
+### Interpreting `mtry`
+
+
+The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models. 
+
+Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
+
+parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
+
+`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.
+
 ### Early stopping
 
 The `stop_iter()`  argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations. 
diff --git a/man/rmd/rule_fit_xrf.Rmd b/man/rmd/rule_fit_xrf.Rmd
@@ -83,11 +83,18 @@ Also, there are several configuration differences in how `xrf()` is fit between
 
 These differences will create a disparity in the values of the `penalty` argument that **glmnet** uses. Also, **rules** can also set `penalty` whereas **xrf** uses an internal 5-fold cross-validation to determine it (by default).
 
-## Preprocessing requirements
+## Other details
+
+### Preprocessing requirements
 
 ```{r child = "template-makes-dummies.Rmd"}
 ```
 
+### Interpreting `mtry`
+
+```{r child = "template-mtry-prop.Rmd"}
+```
+
 ## References
 
  - Friedman and Popescu. "Predictive learning via rule ensembles." Ann. Appl. Stat. 2 (3) 916- 954, September 2008
diff --git a/man/rmd/rule_fit_xrf.md b/man/rmd/rule_fit_xrf.md
@@ -65,7 +65,7 @@ rule_fit(
 ## Computational engine: xrf 
 ## 
 ## Model fit template:
-## rules::xrf_fit(object = missing_arg(), data = missing_arg(), 
+## rules::xrf_fit(formula = missing_arg(), data = missing_arg(), 
 ##     colsample_bytree = numeric(1), nrounds = integer(1), min_child_weight = integer(1), 
 ##     max_depth = integer(1), eta = numeric(1), gamma = numeric(1), 
 ##     subsample = numeric(1), lambda = numeric(1))
@@ -111,7 +111,7 @@ rule_fit(
 ## Computational engine: xrf 
 ## 
 ## Model fit template:
-## rules::xrf_fit(object = missing_arg(), data = missing_arg(), 
+## rules::xrf_fit(formula = missing_arg(), data = missing_arg(), 
 ##     colsample_bytree = numeric(1), nrounds = integer(1), min_child_weight = integer(1), 
 ##     max_depth = integer(1), eta = numeric(1), gamma = numeric(1), 
 ##     subsample = numeric(1), lambda = numeric(1))
@@ -132,11 +132,24 @@ Also, there are several configuration differences in how `xrf()` is fit between
 
 These differences will create a disparity in the values of the `penalty` argument that **glmnet** uses. Also, **rules** can also set `penalty` whereas **xrf** uses an internal 5-fold cross-validation to determine it (by default).
 
-## Preprocessing requirements
+## Other details
+
+### Preprocessing requirements
 
 
 Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine. When using the formula method via \\code{\\link[=fit.model_spec]{fit()}}, parsnip will convert factor columns to indicators.
 
+### Interpreting `mtry`
+
+
+The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models. 
+
+Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
+
+parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
+
+`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.
+
 ## References
 
  - Friedman and Popescu. "Predictive learning via rule ensembles." Ann. Appl. Stat. 2 (3) 916- 954, September 2008
diff --git a/man/rmd/template-mtry-prop.Rmd b/man/rmd/template-mtry-prop.Rmd
@@ -0,0 +1,7 @@
+The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models. 
+
+Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
+
+parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
+
+`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.