Skip to content

Add argument for one hot encoding to parsnip #332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jul 2, 2020
Merged

Conversation

juliasilge
Copy link
Member

This PR connects tidymodels/hardhat#140 to parsnip, with one_hot as an encoding option in registering a model.

library(parsnip)

c("boost_tree",
  "decision_tree",
  "linear_reg",
  "logistic_reg",
  "mars",
  "mlp",
  "multinom_reg",
  "nearest_neighbor",
  "null_model",
  "rand_forest",
  "surv_reg",
  "svm_poly",
  "svm_rbf") %>%
  purrr::map_dfr(get_encoding) %>%
  knitr::kable()
model engine mode predictor_indicators one_hot
boost_tree xgboost regression TRUE FALSE
boost_tree xgboost classification TRUE FALSE
boost_tree C5.0 classification FALSE FALSE
boost_tree spark regression TRUE FALSE
boost_tree spark classification TRUE FALSE
decision_tree rpart regression FALSE FALSE
decision_tree rpart classification FALSE FALSE
decision_tree C5.0 classification FALSE FALSE
decision_tree spark regression TRUE FALSE
decision_tree spark classification TRUE FALSE
linear_reg lm regression TRUE FALSE
linear_reg glmnet regression TRUE TRUE
linear_reg stan regression TRUE FALSE
linear_reg spark regression TRUE FALSE
linear_reg keras regression TRUE FALSE
logistic_reg glm classification TRUE FALSE
logistic_reg glmnet classification TRUE TRUE
logistic_reg spark classification TRUE FALSE
logistic_reg keras classification TRUE FALSE
logistic_reg stan classification TRUE FALSE
mars earth regression FALSE FALSE
mars earth classification FALSE FALSE
mlp keras regression TRUE FALSE
mlp keras classification TRUE FALSE
mlp nnet regression TRUE FALSE
mlp nnet classification TRUE FALSE
multinom_reg glmnet classification TRUE TRUE
multinom_reg spark classification TRUE FALSE
multinom_reg keras classification TRUE FALSE
multinom_reg nnet classification TRUE FALSE
nearest_neighbor kknn regression TRUE FALSE
nearest_neighbor kknn classification TRUE FALSE
null_model parsnip regression FALSE FALSE
null_model parsnip classification FALSE FALSE
rand_forest ranger classification FALSE FALSE
rand_forest ranger regression FALSE FALSE
rand_forest randomForest classification FALSE FALSE
rand_forest randomForest regression FALSE FALSE
rand_forest spark classification TRUE FALSE
rand_forest spark regression TRUE FALSE
surv_reg flexsurv regression TRUE FALSE
surv_reg survival regression TRUE FALSE
svm_poly kernlab regression FALSE FALSE
svm_poly kernlab classification FALSE FALSE
svm_rbf kernlab regression FALSE FALSE
svm_rbf kernlab classification FALSE FALSE
svm_rbf liquidSVM regression FALSE FALSE
svm_rbf liquidSVM classification FALSE FALSE

Created on 2020-06-18 by the reprex package (v0.3.0.9001)

I have not yet worked on any changes to form_xy() or convert_form_to_xy_fit() to use the one_hot argument. I believe convert_form_to_xy_fit() will need to use some of the new hardhat work, like contrasts, etc.

@juliasilge
Copy link
Member Author

Ah actually, I just realized I may have our plan not laid out in my head clearly. parsnip does not depend on hardhat at all right now. Let's chat more about the plan for, for example, contrasts.

@juliasilge
Copy link
Member Author

Closes #326

@topepo
Copy link
Member

topepo commented Jun 24, 2020

After doing some tests, I think that the glmnet models should be changed to the traditional indicator scheme.

My recollection about this was half right - while an intercept is not created with the other coefficients (or regularized), one is calculated after parameter estimation.

I did some sanity checking with a simple model containing a single factor predictor. Doing a one-hot encoding would results in incorrect and inaccurate parameters (unless parsnip were to choose the glmnet option to not estimate the intercept and that's too much of a deviation).

I"ll update the PR to change those back.

Also, I thought that we were going to do one-hot for xgboost. Am I mis-remembering that?

@juliasilge
Copy link
Member Author

The latest version here only handles the "traditional" indicators when going to convert_form_to_xy_fit(), via:

indicators <- indicators == "traditional"

I'm leaving this as draft because we need to still handle the one hot case, with different contrasts and all that.

The current results of get_encoding() are:

library(parsnip)

c("boost_tree",
  "decision_tree",
  "linear_reg",
  "logistic_reg",
  "mars",
  "mlp",
  "multinom_reg",
  "nearest_neighbor",
  "null_model",
  "rand_forest",
  "surv_reg",
  "svm_poly",
  "svm_rbf") %>%
  purrr::map_dfr(get_encoding) %>%
  knitr::kable()
model engine mode predictor_indicators
boost_tree xgboost regression one_hot
boost_tree xgboost classification one_hot
boost_tree C5.0 classification none
boost_tree spark regression traditional
boost_tree spark classification traditional
decision_tree rpart regression none
decision_tree rpart classification none
decision_tree C5.0 classification none
decision_tree spark regression traditional
decision_tree spark classification traditional
linear_reg lm regression traditional
linear_reg glmnet regression traditional
linear_reg stan regression traditional
linear_reg spark regression traditional
linear_reg keras regression traditional
logistic_reg glm classification traditional
logistic_reg glmnet classification traditional
logistic_reg spark classification traditional
logistic_reg keras classification traditional
logistic_reg stan classification traditional
mars earth regression none
mars earth classification none
mlp keras regression traditional
mlp keras classification traditional
mlp nnet regression traditional
mlp nnet classification traditional
multinom_reg glmnet classification traditional
multinom_reg spark classification traditional
multinom_reg keras classification traditional
multinom_reg nnet classification traditional
nearest_neighbor kknn regression traditional
nearest_neighbor kknn classification traditional
null_model parsnip regression none
null_model parsnip classification none
rand_forest ranger classification none
rand_forest ranger regression none
rand_forest randomForest classification none
rand_forest randomForest regression none
rand_forest spark classification traditional
rand_forest spark regression traditional
surv_reg flexsurv regression traditional
surv_reg survival regression traditional
svm_poly kernlab regression none
svm_poly kernlab classification none
svm_rbf kernlab regression none
svm_rbf kernlab classification none
svm_rbf liquidSVM regression none
svm_rbf liquidSVM classification none

Created on 2020-06-26 by the reprex package (v0.3.0.9001)

@topepo topepo marked this pull request as ready for review July 1, 2020 18:13
topepo and others added 2 commits July 1, 2020 19:21
Co-authored-by: Julia Silge <[email protected]>
Co-authored-by: Julia Silge <[email protected]>
topepo and others added 2 commits July 1, 2020 20:35
Co-authored-by: Julia Silge <[email protected]>
Co-authored-by: Julia Silge <[email protected]>
@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
@juliasilge juliasilge deleted the one-hot-encoding branch June 27, 2021 16:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants