add stop_iter as a main argument to rule_fit() (#749)

simonpcouch · web-flow · commit 68a4a4925742 · 2022-06-08T16:15:55.000-04:00
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: parsnip
 Title: A Common API to Modeling and Analysis Functions
-Version: 0.2.1.9002
+Version: 0.2.1.9003
 Authors@R: c(
     person("Max", "Kuhn", , "max@rstudio.com", role = c("aut", "cre")),
     person("Davis", "Vaughan", , "davis@rstudio.com", role = "aut"),
diff --git a/R/rule_fit.R b/R/rule_fit.R
@@ -40,6 +40,7 @@ rule_fit <-
            tree_depth = NULL, learn_rate = NULL,
            loss_reduction = NULL,
            sample_size = NULL,
+           stop_iter = NULL,
            penalty = NULL,
            engine = "xrf") {
 
@@ -51,6 +52,7 @@ rule_fit <-
       learn_rate = enquo(learn_rate),
       loss_reduction = enquo(loss_reduction),
       sample_size = enquo(sample_size),
+      stop_iter = enquo(stop_iter),
       penalty = enquo(penalty)
     )
 
diff --git a/man/rmd/boost_tree_xgboost.Rmd b/man/rmd/boost_tree_xgboost.Rmd
@@ -75,11 +75,8 @@ By default, the model is trained without parallel processing. This can be change
 
 ### Early stopping
 
-The `stop_iter()`  argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations. 
-
-The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of [xgb_train()] via the parsnip [set_engine()] function. This is the proportion of the training set that should be reserved for measuring performance (and stop early). 
-
-If the model specification has `early_stop >= trees`, `early_stop` is converted to `trees - 1` and a warning is issued. 
+```{r child = "template-early-stopping.Rmd"}
+```
 
 ### Objective function
 
diff --git a/man/rmd/boost_tree_xgboost.md b/man/rmd/boost_tree_xgboost.md
@@ -130,9 +130,10 @@ parsnip and its extensions accommodate this parameterization using the `counts`
 
 ### Early stopping
 
+
 The `stop_iter()`  argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations. 
 
-The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of [xgb_train()] via the parsnip [set_engine()] function. This is the proportion of the training set that should be reserved for measuring performance (and stop early). 
+The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of \\code{\\link[=xgb_train]{xgb_train()}} via the parsnip \\code{\\link[=set_engine]{set_engine()}} function. This is the proportion of the training set that should be reserved for measuring performance (and stopping early). 
 
 If the model specification has `early_stop >= trees`, `early_stop` is converted to `trees - 1` and a warning is issued. 
 
diff --git a/man/rmd/rule_fit_xrf.Rmd b/man/rmd/rule_fit_xrf.Rmd
@@ -7,8 +7,8 @@
 
 ```{r xrf-param-info, echo = FALSE}
 defaults <- 
-  tibble::tibble(parsnip = c("tree_depth", "trees", "learn_rate", "mtry", "min_n", "loss_reduction", "sample_size", "penalty"),
-                 default = c("6L",           "15L",        "0.3",  "1.0",    "1L",            "0.0",         "1.0",     "0.1"))
+  tibble::tibble(parsnip = c("tree_depth", "trees", "learn_rate", "mtry",       "min_n", "loss_reduction", "sample_size", "stop_iter", "penalty"),
+                 default = c("6L",           "15L",        "0.3",  "see below", "1L",    "0.0",            "1.0",         "Inf",       "0.1"))
 
 param <-
   rule_fit() %>% 
@@ -83,18 +83,23 @@ Also, there are several configuration differences in how `xrf()` is fit between
 
 These differences will create a disparity in the values of the `penalty` argument that **glmnet** uses. Also, **rules** can also set `penalty` whereas **xrf** uses an internal 5-fold cross-validation to determine it (by default).
 
-## Other details
-
-### Preprocessing requirements
+## Preprocessing requirements
 
 ```{r child = "template-makes-dummies.Rmd"}
 ```
 
+## Other details
+
 ### Interpreting `mtry`
 
 ```{r child = "template-mtry-prop.Rmd"}
 ```
 
+### Early stopping
+
+```{r child = "template-early-stopping.Rmd"}
+```
+
 ## Case weights
 
 ```{r child = "template-no-case-weights.Rmd"}
diff --git a/man/rmd/rule_fit_xrf.md b/man/rmd/rule_fit_xrf.md
@@ -9,7 +9,7 @@ For this engine, there are multiple modes: classification and regression
 
 This model has 8 tuning parameters:
 
-- `mtry`: Proportion Randomly Selected Predictors (type: double, default: 1.0)
+- `mtry`: Proportion Randomly Selected Predictors (type: double, default: see below)
 
 - `trees`: # Trees (type: integer, default: 15L)
 
@@ -65,7 +65,7 @@ rule_fit(
 ## Computational engine: xrf 
 ## 
 ## Model fit template:
-## rules::xrf_fit(object = missing_arg(), data = missing_arg(), 
+## rules::xrf_fit(formula = missing_arg(), data = missing_arg(), 
 ##     colsample_bytree = numeric(1), nrounds = integer(1), min_child_weight = integer(1), 
 ##     max_depth = integer(1), eta = numeric(1), gamma = numeric(1), 
 ##     subsample = numeric(1), lambda = numeric(1))
@@ -111,7 +111,7 @@ rule_fit(
 ## Computational engine: xrf 
 ## 
 ## Model fit template:
-## rules::xrf_fit(object = missing_arg(), data = missing_arg(), 
+## rules::xrf_fit(formula = missing_arg(), data = missing_arg(), 
 ##     colsample_bytree = numeric(1), nrounds = integer(1), min_child_weight = integer(1), 
 ##     max_depth = integer(1), eta = numeric(1), gamma = numeric(1), 
 ##     subsample = numeric(1), lambda = numeric(1))
@@ -134,9 +134,30 @@ These differences will create a disparity in the values of the `penalty` argumen
 
 ## Preprocessing requirements
 
-
 Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine. When using the formula method via \\code{\\link[=fit.model_spec]{fit()}}, parsnip will convert factor columns to indicators.
 
+## Other details
+
+### Interpreting `mtry`
+
+
+The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models. 
+
+Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.
+
+parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.
+
+`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.
+
+### Early stopping
+
+
+The `stop_iter()`  argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations. 
+
+The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of \\code{\\link[=xgb_train]{xgb_train()}} via the parsnip \\code{\\link[=set_engine]{set_engine()}} function. This is the proportion of the training set that should be reserved for measuring performance (and stopping early). 
+
+If the model specification has `early_stop >= trees`, `early_stop` is converted to `trees - 1` and a warning is issued. 
+
 ## Case weights
 
 
diff --git a/man/rmd/template-early-stopping.Rmd b/man/rmd/template-early-stopping.Rmd
@@ -0,0 +1,5 @@
+The `stop_iter()`  argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations. 
+
+The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of \\code{\\link[=xgb_train]{xgb_train()}} via the parsnip \\code{\\link[=set_engine]{set_engine()}} function. This is the proportion of the training set that should be reserved for measuring performance (and stopping early). 
+
+If the model specification has `early_stop >= trees`, `early_stop` is converted to `trees - 1` and a warning is issued. 
diff --git a/man/rule_fit.Rd b/man/rule_fit.Rd