Skip to content

Handling alternative data argument names and fixing calls #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 22, 2020

Conversation

topepo
Copy link
Member

@topepo topepo commented May 20, 2020

As mentioned in #315:

In some cases, a model function that parsnip is calling has non-standard argument names for the data-oriented components.

For example, a function foo() that only has the x/y interface might have a signature like foo(X, Y) instead of the standard foo(x, y).

Some real examples:

  • sparklyr models use x as the data slot (and put it in the first position)
  • kernlab::ksvm(x, data) where x is a formula.
  • MASS::lda(x, grouping)

This PR allows for these types of arguments.

Model definitions have the option to add a data element to the definition that maps the argument name to the objects that are created by fit() or fit_xy().

For example, for formula-based ksvm() models, the new definition uses

set_fit(
  model = "svm_rbf",
  eng = "kernlab",
  mode = "regression",
  value = list(
    interface = "formula",
    data = c(formula = "x", data = "data"),    # <---- the new bit here
    protect = c("x", "data"),
    func = c(pkg = "kernlab", fun = "ksvm"),
    defaults = list(kernel = "rbfdot")
  )
)

This is optional so most models are unaffected. The model definitions that were changed in the PR:

  • svm_rbf() and svm_poly() with kernlab engines (for the above reason)

  • mars() with earth engines (for other reasons)

Bonus Feature 1 🎉

Previously, when the arguments were translated to a call, the actual formula was not used. It contained a generic symbol called formula. In this PR we substitute the user-provided formula to the call.

> mod <-
+   decision_tree() %>%
+   set_engine("rpart", method = "anova") %>%
+   set_mode("regression")
> 
> mod_fit <- fit(mod, mpg ~ ., data = mtcars)
> 
> # previously
> mod_fit$fit$call
rpart::rpart(formula = formula, data = data, method = ~"anova")
> 
> # now
> mod_fit$fit$call
rpart::rpart(formula = mpg ~ ., data = data, method = ~"anova")

This should help with #274. @StephenMilborrow, @apreshill, and others have mentioned this issue of borked calls in the past. This and the extra special bonus feature below should make this more tolerable.

Bonus Feature 2 🎉 🎉

The above change doesn't really solve #274 since 1) the generic data symbol does not link to the user's data object and 2) extra arguments (like anova above) are quosures and not the real object.

This version contains a function called repair_call() that fixes both of these issues if the user provides the original data to the function (just to get its name).

For the example above:

> mod <-
+   decision_tree() %>%
+   set_engine("rpart", method = "anova") %>%
+   set_mode("regression")
> 
> mod_fit <- fit(mod, mpg ~ ., data = mtcars)
> 
> # old and busted
> mod_fit$fit$call
rpart::rpart(formula = mpg ~ ., data = data, method = ~"anova")
> 
> # new hotness
> repair_call(mod_fit, mtcars)$fit$call
rpart::rpart(formula = mpg ~ ., data = mtcars, method = "anova")
>
> mod2 <- repair_call(mod_fit, mtcars)
> partykit::as.party(mod2$fit)

Model formula:
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb

Fitted party:
[1] root
|   [2] cyl >= 5
|   |   [3] hp >= 192.5: 13.414 (n = 7, err = 28.8)
|   |   [4] hp < 192.5: 18.264 (n = 14, err = 59.9)
|   [5] cyl < 5: 26.664 (n = 11, err = 203.4)

Number of inner nodes:    2
Number of terminal nodes: 3

You might wonder why we need to manually call this extra function. Can't this function be invoked at the end of fit() to happen automatically?

The problem with automating this is that, once tune and workflows are used, the data that are passed internally to fit() might not available to the user (i.e. the might not match the user's data object). The easiest example is when a recipe is used in a workflow; the processed data are not the same as the original data.

In this case, the user could run the recipe on the data that was used to build the model and then repair the call with that object.

@juliasilge
Copy link
Member

Aaaaah, this is great.

@StephenMilborrow
Copy link

Thank you for this. This will help compatibility with existing tools. I love the idea of a uniform interface to models (everyone does), but past efforts I've seen to achieve this (e.g. the mlr package) have done so in a way that borked existing tools. The caret package is a noteable exception. I tried to summarize existing old-school practices in Guidelines for S3 Regression Models .

@github-actions
Copy link

github-actions bot commented Mar 7, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 7, 2021
@hfrick hfrick deleted the alt-argument-names branch September 17, 2021 14:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants