Skip to content

Adding CATE and GATE within IRM for one variable #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

CarlesRieraA
Copy link

@CarlesRieraA CarlesRieraA commented Jul 28, 2022

Thanks for contributing to DoubleML.
Before submitting a PR, please take a look at our contribution guidelines.
Additionally, please fill out the PR checklist below.

Description

Added CATE and GATE implementation within IRM model for one conditional variable. (Orthogonal) polynomial and splines approximations are implemented
Synthetic example to be the basic example on how to use it

Reference to Issues or PRs

CATE estimation #46 (on the doubleml-hiwi-sandbox repo)

Comments

I am doing this PR to serve as a basis of discussion for how to incorporate CATE into the package. Further discussions are welcomed and I will do my best to adapt it.
No unit tests have been developed so far for the CATE/GATE. We would need to talk about that as well.
After running the current tests, the following ones failed: test_fetch_bonus_poly, test_irm_exception_with_missings. I'm not sure that's related to my implementation, but it would be great to talk about it.
It would also be great to have a discussion on how to document it in the best way.

PR Checklist

Please fill out this PR checklist (see our contributing guidelines for details).

  • The title of the pull request summarizes the changes made.
  • The PR contains a detailed description of all changes and additions.
  • References to related issues or PRs are added.
  • The code passes all (unit) tests.
  • Enhancements or new feature are equipped with unit tests.
  • The changes adhere to the PEP8 standards.

@lgtm-com
Copy link

lgtm-com bot commented Jul 28, 2022

This pull request introduces 6 alerts when merging c0fc156 into b6d130a - view on LGTM.com

new alerts:

  • 6 for Unused import

@CarlesRieraA CarlesRieraA marked this pull request as draft July 28, 2022 14:28
Comment on lines +149 to +151
def cate(self, cate_var: str, method: str, alpha: float, n_grid_nodes: int,
n_samples_bootstrap: int, cv: bool, poly_degree: int = None, splines_knots: int = None,
splines_degree: int = None, ortho: bool = False, x_grid: np.array = None) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code base, we so far don't work with type hints. Such a change of the code style should not come together with a new feature. Therefore, please remove the code hints.

Suggested change
def cate(self, cate_var: str, method: str, alpha: float, n_grid_nodes: int,
n_samples_bootstrap: int, cv: bool, poly_degree: int = None, splines_knots: int = None,
splines_degree: int = None, ortho: bool = False, x_grid: np.array = None) -> dict:
def cate(self, cate_var, method, alpha, n_grid_nodes,
n_samples_bootstrap, cv, poly_degree=None, splines_knots=None,
splines_degree=None, ortho=False, x_grid=None)t:

General remark on type hints: Adding type hints is a bigger decision and would also require numerous adaptions in contributing guidelines, etc. Some of the larger python packages encourage to use type hints (PEP 484), see for example pandas. Others not yet, see e.g. the discussions scikit-learn/scikit-learn#16705 & fairlearn/fairlearn#592.

Comment on lines +158 to +168
cate_var: variable whose CATE is calculated
method: "poly" or "splines", chooses which method to approximate with
alpha: p-value for the confidence intervals
n_grid_nodes: number of points of X for which we wish to calculate the CATE
n_samples_bootstrap: how many samples to use to calculate the t-statistics described in 2.6
cv: whether to perform cross-validation to find the degree of the polynomials / number of knots
poly_degree: maximum degree of the polynomial approximation
splines_knots: max knots for the splines approximation
splines_degree: degree of the polynomials used in the splines
ortho: whether to use orthogonal polynomials
x_grid: grid points on which to evaluate the CATE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please align the docstring with the other functionality in the DoubleML package, see for example

Parameters
----------
obj_dml_data : :class:`DoubleMLData` object
The :class:`DoubleMLData` object providing the data and specifying the variables for the causal model.
ml_l : estimator implementing ``fit()`` and ``predict()``
A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.
:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function :math:`\\ell_0(X) = E[Y|X]`.
ml_m : estimator implementing ``fit()`` and ``predict()``
A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.
:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function :math:`m_0(X) = E[D|X]`.
For binary treatment variables :math:`D` (with values 0 and 1), a classifier implementing ``fit()`` and
``predict_proba()`` can also be specified. If :py:func:`sklearn.base.is_classifier` returns ``True``,
``predict_proba()`` is used otherwise ``predict()``.
ml_g : estimator implementing ``fit()`` and ``predict()``
A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.
:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function
:math:`g_0(X) = E[Y - D \\theta_0|X]`.
Note: The learner `ml_g` is only required for the score ``'IV-type'``. Optionally, it can be specified and
estimated for callable scores.
n_folds : int
Number of folds.
Default is ``5``.
n_rep : int
Number of repetitons for the sample splitting.
Default is ``1``.
score : str or callable
A str (``'partialling out'`` or ``'IV-type'``) specifying the score function
or a callable object / function with signature ``psi_a, psi_b = score(y, d, l_hat, m_hat, g_hat, smpls)``.
Default is ``'partialling out'``.
dml_procedure : str
A str (``'dml1'`` or ``'dml2'``) specifying the double machine learning algorithm.
Default is ``'dml2'``.
draw_sample_splitting : bool
Indicates whether the sample splitting should be drawn during initialization of the object.
Default is ``True``.
apply_cross_fitting : bool
Indicates whether cross-fitting should be applied.
Default is ``True``.

This means:

  • Add the type in the first line of the docstring
  • Add the description of the variable in the proceeding line (with indentation)
  • If a default value is specified, state it in the last line
  • Add a newline between parameters
  • ...

Comment on lines +170 to +173
Returns
-------
A dictionary containing the estimated CATE (g_hat), with upper and lower confidence bounds (both simultaneous as
well as pointwise), fitted linear model and grid of X for which the CATE was calculated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be the same format as Parameters, see e.g.

Returns
-------
self : object
Returned if ``return_tune_res`` is ``False``.
tune_res: list
A list containing detailed tuning results and the proposed hyperparameters.
Returned if ``return_tune_res`` is ``True``.

well as pointwise), fitted linear model and grid of X for which the CATE was calculated

"""
X = np.array(self._dml_data.data[cate_var])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add exception handling for every input variable. The type as well as the value should be checked. For an example, see

if (not isinstance(param_grids, dict)) | (not all(k in param_grids for k in self.learner_names)):
raise ValueError('Invalid param_grids ' + str(param_grids) + '. '
'param_grids must be a dictionary with keys ' + ' and '.join(self.learner_names) + '.')
if scoring_methods is not None:
if (not isinstance(scoring_methods, dict)) | (not all(k in self.learner_names for k in scoring_methods)):
raise ValueError('Invalid scoring_methods ' + str(scoring_methods) + '. ' +
'scoring_methods must be a dictionary. ' +
'Valid keys are ' + ' and '.join(self.learner_names) + '.')
if not all(k in scoring_methods for k in self.learner_names):
# if there are learners for which no scoring_method was set, we fall back to None, i.e., default scoring
for learner in self.learner_names:
if learner not in scoring_methods:
scoring_methods[learner] = None
if not isinstance(tune_on_folds, bool):
raise TypeError('tune_on_folds must be True or False. '
f'Got {str(tune_on_folds)}.')
if not isinstance(n_folds_tune, int):
raise TypeError('The number of folds used for tuning must be of int type. '
f'{str(n_folds_tune)} of type {str(type(n_folds_tune))} was passed.')
if n_folds_tune < 2:
raise ValueError('The number of folds used for tuning must be at least two. '
f'{str(n_folds_tune)} was passed.')
if (not isinstance(search_mode, str)) | (search_mode not in ['grid_search', 'randomized_search']):
raise ValueError('search_mode must be "grid_search" or "randomized_search". '
f'Got {str(search_mode)}.')
if not isinstance(n_iter_randomized_search, int):
raise TypeError('The number of parameter settings sampled for the randomized search must be of int type. '
f'{str(n_iter_randomized_search)} of type '
f'{str(type(n_iter_randomized_search))} was passed.')
if n_iter_randomized_search < 2:
raise ValueError('The number of parameter settings sampled for the randomized search must be at least two. '
f'{str(n_iter_randomized_search)} was passed.')
if n_jobs_cv is not None:
if not isinstance(n_jobs_cv, int):
raise TypeError('The number of CPUs used to fit the learners must be of int type. '
f'{str(n_jobs_cv)} of type {str(type(n_jobs_cv))} was passed.')
if not isinstance(set_as_params, bool):
raise TypeError('set_as_params must be True or False. '
f'Got {str(set_as_params)}.')
if not isinstance(return_tune_res, bool):
raise TypeError('return_tune_res must be True or False. '
f'Got {str(return_tune_res)}.')
Each exception handling should additionally be equipped with unit tests, to make sure that throwing the exceptions works as intended, see e.g. https://github.com/DoubleML/doubleml-for-py/blob/master/doubleml/tests/test_doubleml_exceptions.py.

Comment on lines +178 to +182
if x_grid is None:
print("Authomatically generating grid for CATE evaluation")
x_grid = np.linspace(np.round(np.min(X), 2), np.round(np.max(X), 2), n_grid_nodes).reshape(-1, 1)
else:
print("User-provided grid will be used for CATE evaluation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to not have such print statements. If they are really essential, we should add a verbosity variable such that users still has the option to turn the printing to the console off.

@@ -1,9 +1,15 @@
import patsy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really need the pkg patsy, we also need to list it as requirement of the package.

Comment on lines +462 to +463


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these additional new lines

@@ -0,0 +1,141 @@
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file synthetic_example.py from the PR. I don't think that the repo containing the package code is the right place for such a demonstration / example. We can transfer it into an example for the gallery (https://docs.doubleml.org/stable/examples/index.html). The source code for the example gallery can be found in the doubleml-docs repo, see https://github.com/DoubleML/doubleml-docs/tree/master/doc/examples.

Comment on lines +1 to +18
import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import RegressionResults

import doubleml as dml
from doubleml.datasets import fetch_401K

from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LassoCV
import statsmodels.api as sm

from scipy.stats import norm
from scipy.linalg import sqrtm

import patsy

import matplotlib.pyplot as plt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check which imports are in fact used. I think you can remove the following

from statsmodels.regression.linear_model import RegressionResults
from doubleml.datasets import fetch_401K
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
import statsmodels.api as sm
from scipy.stats import norm
from scipy.linalg import sqrtm
import patsy

Comment on lines +280 to +289
results_dict = {
"g_hat": g_hat,
"g_hat_lower": g_hat_lower,
"g_hat_upper": g_hat_upper,
"g_hat_lower_point": g_hat_lower_point,
"g_hat_upper_point": g_hat_upper_point,
"fitted_model": fitted_model,
"gate_type": gate_type,
"n_quantiles": n_quantiles,
"class_name": class_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the GATE, I would prefer that result is an object of a class named GATEResult, EstimatedGATE or just GATE, cf .

@SvenKlaassen
Copy link
Member

I will close this issue since the implemtation of CATE and GATE for the IRM model was done with the PR #169 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants