Add argument for one hot encoding to parsnip #332

juliasilge · 2020-06-18T23:22:06Z

This PR connects tidymodels/hardhat#140 to parsnip, with one_hot as an encoding option in registering a model.

library(parsnip)

c("boost_tree",
  "decision_tree",
  "linear_reg",
  "logistic_reg",
  "mars",
  "mlp",
  "multinom_reg",
  "nearest_neighbor",
  "null_model",
  "rand_forest",
  "surv_reg",
  "svm_poly",
  "svm_rbf") %>%
  purrr::map_dfr(get_encoding) %>%
  knitr::kable()

model	engine	mode	predictor_indicators	one_hot
boost_tree	xgboost	regression	TRUE	FALSE
boost_tree	xgboost	classification	TRUE	FALSE
boost_tree	C5.0	classification	FALSE	FALSE
boost_tree	spark	regression	TRUE	FALSE
boost_tree	spark	classification	TRUE	FALSE
decision_tree	rpart	regression	FALSE	FALSE
decision_tree	rpart	classification	FALSE	FALSE
decision_tree	C5.0	classification	FALSE	FALSE
decision_tree	spark	regression	TRUE	FALSE
decision_tree	spark	classification	TRUE	FALSE
linear_reg	lm	regression	TRUE	FALSE
linear_reg	glmnet	regression	TRUE	TRUE
linear_reg	stan	regression	TRUE	FALSE
linear_reg	spark	regression	TRUE	FALSE
linear_reg	keras	regression	TRUE	FALSE
logistic_reg	glm	classification	TRUE	FALSE
logistic_reg	glmnet	classification	TRUE	TRUE
logistic_reg	spark	classification	TRUE	FALSE
logistic_reg	keras	classification	TRUE	FALSE
logistic_reg	stan	classification	TRUE	FALSE
mars	earth	regression	FALSE	FALSE
mars	earth	classification	FALSE	FALSE
mlp	keras	regression	TRUE	FALSE
mlp	keras	classification	TRUE	FALSE
mlp	nnet	regression	TRUE	FALSE
mlp	nnet	classification	TRUE	FALSE
multinom_reg	glmnet	classification	TRUE	TRUE
multinom_reg	spark	classification	TRUE	FALSE
multinom_reg	keras	classification	TRUE	FALSE
multinom_reg	nnet	classification	TRUE	FALSE
nearest_neighbor	kknn	regression	TRUE	FALSE
nearest_neighbor	kknn	classification	TRUE	FALSE
null_model	parsnip	regression	FALSE	FALSE
null_model	parsnip	classification	FALSE	FALSE
rand_forest	ranger	classification	FALSE	FALSE
rand_forest	ranger	regression	FALSE	FALSE
rand_forest	randomForest	classification	FALSE	FALSE
rand_forest	randomForest	regression	FALSE	FALSE
rand_forest	spark	classification	TRUE	FALSE
rand_forest	spark	regression	TRUE	FALSE
surv_reg	flexsurv	regression	TRUE	FALSE
surv_reg	survival	regression	TRUE	FALSE
svm_poly	kernlab	regression	FALSE	FALSE
svm_poly	kernlab	classification	FALSE	FALSE
svm_rbf	kernlab	regression	FALSE	FALSE
svm_rbf	kernlab	classification	FALSE	FALSE
svm_rbf	liquidSVM	regression	FALSE	FALSE
svm_rbf	liquidSVM	classification	FALSE	FALSE

^{Created on 2020-06-18 by the reprex package (v0.3.0.9001)}

I have not yet worked on any changes to form_xy() or convert_form_to_xy_fit() to use the one_hot argument. I believe convert_form_to_xy_fit() will need to use some of the new hardhat work, like contrasts, etc.

juliasilge · 2020-06-18T23:40:04Z

Ah actually, I just realized I may have our plan not laid out in my head clearly. parsnip does not depend on hardhat at all right now. Let's chat more about the plan for, for example, contrasts.

juliasilge · 2020-06-23T20:28:07Z

Closes #326

topepo · 2020-06-24T19:06:09Z

After doing some tests, I think that the glmnet models should be changed to the traditional indicator scheme.

My recollection about this was half right - while an intercept is not created with the other coefficients (or regularized), one is calculated after parameter estimation.

I did some sanity checking with a simple model containing a single factor predictor. Doing a one-hot encoding would results in incorrect and inaccurate parameters (unless parsnip were to choose the glmnet option to not estimate the intercept and that's too much of a deviation).

I"ll update the PR to change those back.

Also, I thought that we were going to do one-hot for xgboost. Am I mis-remembering that?

juliasilge · 2020-06-26T19:04:36Z

The latest version here only handles the "traditional" indicators when going to convert_form_to_xy_fit(), via:

indicators <- indicators == "traditional"

I'm leaving this as draft because we need to still handle the one hot case, with different contrasts and all that.

The current results of get_encoding() are:

library(parsnip)

c("boost_tree",
  "decision_tree",
  "linear_reg",
  "logistic_reg",
  "mars",
  "mlp",
  "multinom_reg",
  "nearest_neighbor",
  "null_model",
  "rand_forest",
  "surv_reg",
  "svm_poly",
  "svm_rbf") %>%
  purrr::map_dfr(get_encoding) %>%
  knitr::kable()

model	engine	mode	predictor_indicators
boost_tree	xgboost	regression	one_hot
boost_tree	xgboost	classification	one_hot
boost_tree	C5.0	classification	none
boost_tree	spark	regression	traditional
boost_tree	spark	classification	traditional
decision_tree	rpart	regression	none
decision_tree	rpart	classification	none
decision_tree	C5.0	classification	none
decision_tree	spark	regression	traditional
decision_tree	spark	classification	traditional
linear_reg	lm	regression	traditional
linear_reg	glmnet	regression	traditional
linear_reg	stan	regression	traditional
linear_reg	spark	regression	traditional
linear_reg	keras	regression	traditional
logistic_reg	glm	classification	traditional
logistic_reg	glmnet	classification	traditional
logistic_reg	spark	classification	traditional
logistic_reg	keras	classification	traditional
logistic_reg	stan	classification	traditional
mars	earth	regression	none
mars	earth	classification	none
mlp	keras	regression	traditional
mlp	keras	classification	traditional
mlp	nnet	regression	traditional
mlp	nnet	classification	traditional
multinom_reg	glmnet	classification	traditional
multinom_reg	spark	classification	traditional
multinom_reg	keras	classification	traditional
multinom_reg	nnet	classification	traditional
nearest_neighbor	kknn	regression	traditional
nearest_neighbor	kknn	classification	traditional
null_model	parsnip	regression	none
null_model	parsnip	classification	none
rand_forest	ranger	classification	none
rand_forest	ranger	regression	none
rand_forest	randomForest	classification	none
rand_forest	randomForest	regression	none
rand_forest	spark	classification	traditional
rand_forest	spark	regression	traditional
surv_reg	flexsurv	regression	traditional
surv_reg	survival	regression	traditional
svm_poly	kernlab	regression	none
svm_poly	kernlab	classification	none
svm_rbf	kernlab	regression	none
svm_rbf	kernlab	classification	none
svm_rbf	liquidSVM	regression	none
svm_rbf	liquidSVM	classification	none

^{Created on 2020-06-26 by the reprex package (v0.3.0.9001)}

R/aaa_models.R

R/contr_one_hot.R

man/rmd/one-hot.Rmd

Co-authored-by: Julia Silge <[email protected]>

man/rmd/one-hot.Rmd

Co-authored-by: Julia Silge <[email protected]>

github-actions · 2021-03-06T00:32:00Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

juliasilge added 2 commits June 18, 2020 17:14

Add one hot option to encoding options

8a3b4b7

one_hot = FALSE for almost all models, one_hot = TRUE for glmnet models

3a3743e

juliasilge mentioned this pull request Jun 23, 2020

Add argument for one hot encoding to workflows tidymodels/workflows#53

Merged

topepo and others added 4 commits June 24, 2020 15:10

changed one_hot to logical; less confusing

1f4c40b

revert glmnet encodings to one_hot

4c6641c

Switch from logical to none/traditional/one_hot

e169f8f

Update predictor_indicators in model infrastructure

51c18ad

topepo added 12 commits June 30, 2020 09:06

change objective function name for xgboost regression

3420156

more encoding updates related to intercepts

2430518

set defaults for parsnip objects with no encoding information

00e3180

"one-hot" not "one_hot"

91cc98d

apply encoding changes to form_xy and xy_form paths

cb68875

fully export contrast function

3ebd066

"one_hot" not "one-hot"

c30a50a

fixed a few bugs

9c7df98

revert xgboost change (in another PR)

164c4d3

updated news

ac2aa17

two more global variable false positives

856c829

updates for how many engines handle dummy variables (if at all)

8503ae1

topepo marked this pull request as ready for review July 1, 2020 18:13

topepo added 2 commits July 1, 2020 15:29

details on encoding options

c76ec17

one_hot documentation

d7eee45

juliasilge commented Jul 1, 2020

View reviewed changes

man/rmd/one-hot.Rmd Outdated Show resolved Hide resolved

man/rmd/one-hot.Rmd Outdated Show resolved Hide resolved

man/rmd/one-hot.Rmd Outdated Show resolved Hide resolved

topepo and others added 2 commits July 1, 2020 19:21

Update R/aaa_models.R

a2308d9

Co-authored-by: Julia Silge <[email protected]>

Update R/aaa_models.R

7318d7f

Co-authored-by: Julia Silge <[email protected]>

topepo and others added 7 commits July 1, 2020 19:21

Update R/aaa_models.R

334f01c

Co-authored-by: Julia Silge <[email protected]>

Update R/aaa_models.R

110ca67

Co-authored-by: Julia Silge <[email protected]>

Update R/aaa_models.R

d70e414

Co-authored-by: Julia Silge <[email protected]>

Update R/contr_one_hot.R

ea3ec8c

Co-authored-by: Julia Silge <[email protected]>

Update man/rmd/one-hot.Rmd

fc4f165

Co-authored-by: Julia Silge <[email protected]>

Update man/rmd/one-hot.Rmd

9a11306

Co-authored-by: Julia Silge <[email protected]>

documentation updates for one-hot

ccd52bb

juliasilge commented Jul 1, 2020

View reviewed changes

man/rmd/one-hot.Rmd Outdated Show resolved Hide resolved

juliasilge commented Jul 1, 2020

View reviewed changes

man/rmd/one-hot.Rmd Outdated Show resolved Hide resolved

juliasilge commented Jul 1, 2020

View reviewed changes

man/rmd/one-hot.Rmd Show resolved Hide resolved

topepo and others added 2 commits July 1, 2020 20:35

Update man/rmd/one-hot.Rmd

d04b892

Co-authored-by: Julia Silge <[email protected]>

Update man/rmd/one-hot.Rmd

f164247

Co-authored-by: Julia Silge <[email protected]>

topepo merged commit aab7e0c into master Jul 2, 2020

This was referenced Jul 2, 2020

Unexpectedly different behavior for factors/dummy variables between parsnip and workflows #326

Closed

Issues with behind-the-scenes, surprising variable pre-processing and ranger package for random forests tidymodels/tune#151

Closed

cimentadaj mentioned this pull request Aug 17, 2020

fit_resamples creates one-hot encoding for some datasets and not for others tidymodels/tune#262

Closed

github-actions bot locked and limited conversation to collaborators Mar 6, 2021

juliasilge deleted the one-hot-encoding branch June 27, 2021 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add argument for one hot encoding to parsnip #332

Add argument for one hot encoding to parsnip #332

Uh oh!

juliasilge commented Jun 18, 2020

Uh oh!

juliasilge commented Jun 18, 2020

Uh oh!

juliasilge commented Jun 23, 2020

Uh oh!

topepo commented Jun 24, 2020

Uh oh!

juliasilge commented Jun 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2021

Uh oh!

Uh oh!

Add argument for one hot encoding to parsnip #332

Add argument for one hot encoding to parsnip #332

Uh oh!

Conversation

juliasilge commented Jun 18, 2020

Uh oh!

juliasilge commented Jun 18, 2020

Uh oh!

juliasilge commented Jun 23, 2020

Uh oh!

topepo commented Jun 24, 2020

Uh oh!

juliasilge commented Jun 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2021

Uh oh!

Uh oh!