Skip to content

xgboost mtry parameter swap for #495 #499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 21, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,7 @@

* The `liquidSVM` engine for `svm_rbf()` was deprecated due to that package's removal from CRAN. (#425)

* New model specification `survival_reg()` for the new mode `"censored regression"` (#444). `surv_reg()` is now soft-deprecated (#448).

* New model specification `proportional_hazards()` for the `"censored regression"` mode (#451).
* The xgboost engine for boosted trees was translating `mtry` to xgboost's `colsample_bytree`. We now map `mtry` to `colsample_bynode` since that is more consistent with how random forest works. `colsample_bytree` can still be optimized by passing it in as an engine argument. (#495)

## Other Changes

Expand Down
16 changes: 13 additions & 3 deletions R/boost_tree.R
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,9 @@ check_args.boost_tree <- function(object) {
#' @param max_depth An integer for the maximum depth of the tree.
#' @param nrounds An integer for the number of boosting iterations.
#' @param eta A numeric value between zero and one to control the learning rate.
#' @param colsample_bytree Subsampling proportion of columns.
#' @param colsample_bytree Subsampling proportion of columns for each tree.
#' @param colsample_bynode Subsampling proportion of columns for each node
#' within each tree.
#' @param min_child_weight A numeric value for the minimum sum of instance
#' weights needed in a child to continue to split.
#' @param gamma A number for the minimum loss reduction required to make a
Expand All @@ -290,8 +292,8 @@ check_args.boost_tree <- function(object) {
#' @export
xgb_train <- function(
x, y,
max_depth = 6, nrounds = 15, eta = 0.3, colsample_bytree = 1,
min_child_weight = 1, gamma = 0, subsample = 1, validation = 0,
max_depth = 6, nrounds = 15, eta = 0.3, colsample_bynode = 1,
colsample_bytree = 1, min_child_weight = 1, gamma = 0, subsample = 1, validation = 0,
early_stop = NULL, objective = NULL,
event_level = c("first", "second"),
...) {
Expand Down Expand Up @@ -346,6 +348,13 @@ xgb_train <- function(
colsample_bytree <- 1
}

if (colsample_bynode > 1) {
colsample_bynode <- colsample_bynode/p
}
if (colsample_bynode > 1) {
colsample_bynode <- 1
}

if (min_child_weight > n) {
msg <- paste0(min_child_weight, " samples were requested but there were ",
n, " rows in the data. ", n, " will be used.")
Expand All @@ -358,6 +367,7 @@ xgb_train <- function(
max_depth = max_depth,
gamma = gamma,
colsample_bytree = colsample_bytree,
colsample_bynode = colsample_bynode,
min_child_weight = min(min_child_weight, n),
subsample = subsample,
objective = objective
Expand Down
2 changes: 1 addition & 1 deletion R/boost_tree_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ set_model_arg(
model = "boost_tree",
eng = "xgboost",
parsnip = "mtry",
original = "colsample_bytree",
original = "colsample_bynode",
func = list(pkg = "dials", fun = "mtry"),
has_submodel = FALSE
)
Expand Down
2 changes: 1 addition & 1 deletion man/boost_tree.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 2 additions & 3 deletions man/rmd/boost-tree.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,7 @@ mod_param <-
update(sample_size = sample_prop(c(0.4, 0.9)))
```

For this engine, tuning over `trees` is very efficient since the same model
object can be used to make predictions over multiple values of `trees`.
For this engine, tuning over `trees` is very efficient since the same model object can be used to make predictions over multiple values of `trees`.

Note that `xgboost` models require that non-numeric predictors (e.g., factors) must be converted to dummy variables or some other numeric representation. By default, when using `fit()` with `xgboost`, a one-hot encoding is used to convert factor predictors to indicator variables.

Expand Down Expand Up @@ -89,7 +88,7 @@ get_defaults_boost_tree <- function() {
"boost_tree", "xgboost", "tree_depth", "max_depth", get_arg("parsnip", "xgb_train", "max_depth"),
"boost_tree", "xgboost", "trees", "nrounds", get_arg("parsnip", "xgb_train", "nrounds"),
"boost_tree", "xgboost", "learn_rate", "eta", get_arg("parsnip", "xgb_train", "eta"),
"boost_tree", "xgboost", "mtry", "colsample_bytree", get_arg("parsnip", "xgb_train", "colsample_bytree"),
"boost_tree", "xgboost", "mtry", "colsample_bynode", get_arg("parsnip", "xgb_train", "colsample_bynode"),
"boost_tree", "xgboost", "min_n", "min_child_weight", get_arg("parsnip", "xgb_train", "min_child_weight"),
"boost_tree", "xgboost", "loss_reduction", "gamma", get_arg("parsnip", "xgb_train", "gamma"),
"boost_tree", "xgboost", "sample_size", "subsample", get_arg("parsnip", "xgb_train", "subsample"),
Expand Down
6 changes: 5 additions & 1 deletion man/xgb_train.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

17 changes: 15 additions & 2 deletions tests/testthat/test_boost_tree_xgboost.R
Original file line number Diff line number Diff line change
Expand Up @@ -414,9 +414,9 @@ test_that('argument checks for data dimensions', {
xy_fit <- spec %>% fit_xy(x = penguins_dummy, y = penguins$species),
"1000 samples were requested"
)
expect_equal(f_fit$fit$params$colsample_bytree, 1)
expect_equal(f_fit$fit$params$colsample_bynode, 1)
expect_equal(f_fit$fit$params$min_child_weight, nrow(penguins))
expect_equal(xy_fit$fit$params$colsample_bytree, 1)
expect_equal(xy_fit$fit$params$colsample_bynode, 1)
expect_equal(xy_fit$fit$params$min_child_weight, nrow(penguins))

})
Expand Down Expand Up @@ -482,3 +482,16 @@ test_that("fit and prediction with `event_level`", {
expect_equal(pred_p_2[[".pred_male"]], pred_xgb_2)

})

test_that("mtry parameters", {
skip_if_not_installed("xgboost")
fit <-
boost_tree(mtry = .7, trees = 4) %>%
set_engine("xgboost") %>%
set_mode("regression") %>%
fit(mpg ~ ., data = mtcars)
expect_equal(fit$fit$params$colsample_bytree, 1)
expect_equal(fit$fit$params$colsample_bynode, 0.7)
})