Skip to content

add stop_iter as a main argument to rule_fit() #749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 8, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: parsnip
Title: A Common API to Modeling and Analysis Functions
Version: 0.2.1.9002
Version: 0.2.1.9003
Authors@R: c(
person("Max", "Kuhn", , "[email protected]", role = c("aut", "cre")),
person("Davis", "Vaughan", , "[email protected]", role = "aut"),
Expand Down
2 changes: 2 additions & 0 deletions R/rule_fit.R
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ rule_fit <-
tree_depth = NULL, learn_rate = NULL,
loss_reduction = NULL,
sample_size = NULL,
stop_iter = NULL,
penalty = NULL,
engine = "xrf") {

Expand All @@ -51,6 +52,7 @@ rule_fit <-
learn_rate = enquo(learn_rate),
loss_reduction = enquo(loss_reduction),
sample_size = enquo(sample_size),
stop_iter = enquo(stop_iter),
penalty = enquo(penalty)
)

Expand Down
7 changes: 2 additions & 5 deletions man/rmd/boost_tree_xgboost.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -75,11 +75,8 @@ By default, the model is trained without parallel processing. This can be change

### Early stopping

The `stop_iter()` argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations.

The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of [xgb_train()] via the parsnip [set_engine()] function. This is the proportion of the training set that should be reserved for measuring performance (and stop early).

If the model specification has `early_stop >= trees`, `early_stop` is converted to `trees - 1` and a warning is issued.
```{r child = "template-early-stopping.Rmd"}
```

### Objective function

Expand Down
3 changes: 2 additions & 1 deletion man/rmd/boost_tree_xgboost.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,9 +130,10 @@ parsnip and its extensions accommodate this parameterization using the `counts`

### Early stopping


The `stop_iter()` argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations.

The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of [xgb_train()] via the parsnip [set_engine()] function. This is the proportion of the training set that should be reserved for measuring performance (and stop early).
The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of \\code{\\link[=xgb_train]{xgb_train()}} via the parsnip \\code{\\link[=set_engine]{set_engine()}} function. This is the proportion of the training set that should be reserved for measuring performance (and stopping early).

If the model specification has `early_stop >= trees`, `early_stop` is converted to `trees - 1` and a warning is issued.

Expand Down
15 changes: 10 additions & 5 deletions man/rmd/rule_fit_xrf.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

```{r xrf-param-info, echo = FALSE}
defaults <-
tibble::tibble(parsnip = c("tree_depth", "trees", "learn_rate", "mtry", "min_n", "loss_reduction", "sample_size", "penalty"),
default = c("6L", "15L", "0.3", "1.0", "1L", "0.0", "1.0", "0.1"))
tibble::tibble(parsnip = c("tree_depth", "trees", "learn_rate", "mtry", "min_n", "loss_reduction", "sample_size", "stop_iter", "penalty"),
default = c("6L", "15L", "0.3", "see below", "1L", "0.0", "1.0", "Inf", "0.1"))

param <-
rule_fit() %>%
Expand Down Expand Up @@ -83,18 +83,23 @@ Also, there are several configuration differences in how `xrf()` is fit between

These differences will create a disparity in the values of the `penalty` argument that **glmnet** uses. Also, **rules** can also set `penalty` whereas **xrf** uses an internal 5-fold cross-validation to determine it (by default).

## Other details

### Preprocessing requirements
## Preprocessing requirements

```{r child = "template-makes-dummies.Rmd"}
```

## Other details

### Interpreting `mtry`

```{r child = "template-mtry-prop.Rmd"}
```

### Early stopping

```{r child = "template-early-stopping.Rmd"}
```

## Case weights

```{r child = "template-no-case-weights.Rmd"}
Expand Down
29 changes: 25 additions & 4 deletions man/rmd/rule_fit_xrf.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ For this engine, there are multiple modes: classification and regression

This model has 8 tuning parameters:

- `mtry`: Proportion Randomly Selected Predictors (type: double, default: 1.0)
- `mtry`: Proportion Randomly Selected Predictors (type: double, default: see below)

- `trees`: # Trees (type: integer, default: 15L)

Expand Down Expand Up @@ -65,7 +65,7 @@ rule_fit(
## Computational engine: xrf
##
## Model fit template:
## rules::xrf_fit(object = missing_arg(), data = missing_arg(),
## rules::xrf_fit(formula = missing_arg(), data = missing_arg(),
## colsample_bytree = numeric(1), nrounds = integer(1), min_child_weight = integer(1),
## max_depth = integer(1), eta = numeric(1), gamma = numeric(1),
## subsample = numeric(1), lambda = numeric(1))
Expand Down Expand Up @@ -111,7 +111,7 @@ rule_fit(
## Computational engine: xrf
##
## Model fit template:
## rules::xrf_fit(object = missing_arg(), data = missing_arg(),
## rules::xrf_fit(formula = missing_arg(), data = missing_arg(),
## colsample_bytree = numeric(1), nrounds = integer(1), min_child_weight = integer(1),
## max_depth = integer(1), eta = numeric(1), gamma = numeric(1),
## subsample = numeric(1), lambda = numeric(1))
Expand All @@ -134,9 +134,30 @@ These differences will create a disparity in the values of the `penalty` argumen

## Preprocessing requirements


Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine. When using the formula method via \\code{\\link[=fit.model_spec]{fit()}}, parsnip will convert factor columns to indicators.

## Other details

### Interpreting `mtry`


The `mtry` argument denotes the number of predictors that will be randomly sampled at each split when creating tree models.

Some engines, such as `"xgboost"`, `"xrf"`, and `"lightgbm"`, interpret their analogue to the `mtry` argument as the _proportion_ of predictors that will be randomly sampled at each split rather than the _count_. In some settings, such as when tuning over preprocessors that influence the number of predictors, this parameterization is quite helpful---interpreting `mtry` as a proportion means that $[0, 1]$ is always a valid range for that parameter, regardless of input data.

parsnip and its extensions accommodate this parameterization using the `counts` argument: a logical indicating whether `mtry` should be interpreted as the number of predictors that will be randomly sampled at each split. `TRUE` indicates that `mtry` will be interpreted in its sense as a count, `FALSE` indicates that the argument will be interpreted in its sense as a proportion.

`mtry` is a main model argument for \\code{\\link[=boost_tree]{boost_tree()}} and \\code{\\link[=rand_forest]{rand_forest()}}, and thus should not have an engine-specific interface. So, regardless of engine, `counts` defaults to `TRUE`. For engines that support the proportion interpretation---currently `"xgboost"`, `"xrf"` (via the rules package), and `"lightgbm"` (via the bonsai package)---the user can pass the `counts = FALSE` argument to `set_engine()` to supply `mtry` values within $[0, 1]$.

### Early stopping


The `stop_iter()` argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations.

The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of \\code{\\link[=xgb_train]{xgb_train()}} via the parsnip \\code{\\link[=set_engine]{set_engine()}} function. This is the proportion of the training set that should be reserved for measuring performance (and stopping early).

If the model specification has `early_stop >= trees`, `early_stop` is converted to `trees - 1` and a warning is issued.

## Case weights


Expand Down
5 changes: 5 additions & 0 deletions man/rmd/template-early-stopping.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The `stop_iter()` argument allows the model to prematurely stop training if the objective function does not improve within `early_stop` iterations.

The best way to use this feature is in conjunction with an _internal validation set_. To do this, pass the `validation` parameter of \\code{\\link[=xgb_train]{xgb_train()}} via the parsnip \\code{\\link[=set_engine]{set_engine()}} function. This is the proportion of the training set that should be reserved for measuring performance (and stopping early).

If the model specification has `early_stop >= trees`, `early_stop` is converted to `trees - 1` and a warning is issued.
4 changes: 4 additions & 0 deletions man/rule_fit.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.