Handling alternative data argument names and fixing calls #316

topepo · 2020-05-20T23:48:20Z

As mentioned in #315:

In some cases, a model function that parsnip is calling has non-standard argument names for the data-oriented components.

For example, a function foo() that only has the x/y interface might have a signature like foo(X, Y) instead of the standard foo(x, y).

Some real examples:

sparklyr models use x as the data slot (and put it in the first position)
kernlab::ksvm(x, data) where x is a formula.
MASS::lda(x, grouping)

This PR allows for these types of arguments.

Model definitions have the option to add a data element to the definition that maps the argument name to the objects that are created by fit() or fit_xy().

For example, for formula-based ksvm() models, the new definition uses

set_fit(
  model = "svm_rbf",
  eng = "kernlab",
  mode = "regression",
  value = list(
    interface = "formula",
    data = c(formula = "x", data = "data"),    # <---- the new bit here
    protect = c("x", "data"),
    func = c(pkg = "kernlab", fun = "ksvm"),
    defaults = list(kernel = "rbfdot")
  )
)

This is optional so most models are unaffected. The model definitions that were changed in the PR:

svm_rbf() and svm_poly() with kernlab engines (for the above reason)
mars() with earth engines (for other reasons)

Bonus Feature 1 🎉

Previously, when the arguments were translated to a call, the actual formula was not used. It contained a generic symbol called formula. In this PR we substitute the user-provided formula to the call.

> mod <-
+   decision_tree() %>%
+   set_engine("rpart", method = "anova") %>%
+   set_mode("regression")
> 
> mod_fit <- fit(mod, mpg ~ ., data = mtcars)
> 
> # previously
> mod_fit$fit$call
rpart::rpart(formula = formula, data = data, method = ~"anova")
> 
> # now
> mod_fit$fit$call
rpart::rpart(formula = mpg ~ ., data = data, method = ~"anova")

This should help with #274. @StephenMilborrow, @apreshill, and others have mentioned this issue of borked calls in the past. This and the extra special bonus feature below should make this more tolerable.

Bonus Feature 2 🎉 🎉

The above change doesn't really solve #274 since 1) the generic data symbol does not link to the user's data object and 2) extra arguments (like anova above) are quosures and not the real object.

This version contains a function called repair_call() that fixes both of these issues if the user provides the original data to the function (just to get its name).

For the example above:

> mod <-
+   decision_tree() %>%
+   set_engine("rpart", method = "anova") %>%
+   set_mode("regression")
> 
> mod_fit <- fit(mod, mpg ~ ., data = mtcars)
> 
> # old and busted
> mod_fit$fit$call
rpart::rpart(formula = mpg ~ ., data = data, method = ~"anova")
> 
> # new hotness
> repair_call(mod_fit, mtcars)$fit$call
rpart::rpart(formula = mpg ~ ., data = mtcars, method = "anova")
>
> mod2 <- repair_call(mod_fit, mtcars)
> partykit::as.party(mod2$fit)

Model formula:
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb

Fitted party:
[1] root
|   [2] cyl >= 5
|   |   [3] hp >= 192.5: 13.414 (n = 7, err = 28.8)
|   |   [4] hp < 192.5: 18.264 (n = 14, err = 59.9)
|   [5] cyl < 5: 26.664 (n = 11, err = 203.4)

Number of inner nodes:    2
Number of terminal nodes: 3

You might wonder why we need to manually call this extra function. Can't this function be invoked at the end of fit() to happen automatically?

The problem with automating this is that, once tune and workflows are used, the data that are passed internally to fit() might not available to the user (i.e. the might not match the user's data object). The easiest example is when a recipe is used in a workflow; the processed data are not the same as the original data.

In this case, the user could run the recipe on the data that was used to build the model and then repair the call with that object.

juliasilge · 2020-05-20T23:57:16Z

Aaaaah, this is great.

StephenMilborrow · 2020-05-21T16:17:04Z

Thank you for this. This will help compatibility with existing tools. I love the idea of a uniform interface to models (everyone does), but past efforts I've seen to achieve this (e.g. the mlr package) have done so in a way that borked existing tools. The caret package is a noteable exception. I tried to summarize existing old-school practices in Guidelines for S3 Regression Models .

github-actions · 2021-03-07T00:28:50Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

topepo added 10 commits May 20, 2020 14:37

added code to use alternate data argument names #315

3b818b1

changes to some model definitions for #315

3c95d2e

This function along with others changes would solve #274

fa59be1

removed some debugging code

c15ec47

Merge branch 'master' into alt-argument-names

376af2c

more changes for non-standard arguments for #315

6b8ad19

test_path()

d645c0a

adjust tolerance for better testing

3e46c0d

model-specific changes for #315

a871609

call-related changes

9e580d5

topepo merged commit bf507af into master May 22, 2020

This was referenced May 26, 2020

non-standard data arguments #315

Closed

Add engine specification field for predictor encodings #319

Merged

coercing rpart model fit with parsnip to partykit fails #274

Closed

Add partykit plot back in? tidymodels/tidymodels.org-legacy#174

Closed

github-actions bot locked and limited conversation to collaborators Mar 7, 2021

hfrick deleted the alt-argument-names branch September 17, 2021 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling alternative data argument names and fixing calls #316

Handling alternative data argument names and fixing calls #316

Uh oh!

topepo commented May 20, 2020

Uh oh!

juliasilge commented May 20, 2020

Uh oh!

StephenMilborrow commented May 21, 2020

Uh oh!

github-actions bot commented Mar 7, 2021

Uh oh!

Uh oh!

Handling alternative data argument names and fixing calls #316

Handling alternative data argument names and fixing calls #316

Uh oh!

Conversation

topepo commented May 20, 2020

Uh oh!

juliasilge commented May 20, 2020

Uh oh!

StephenMilborrow commented May 21, 2020

Uh oh!

github-actions bot commented Mar 7, 2021

Uh oh!

Uh oh!