-
Notifications
You must be signed in to change notification settings - Fork 94
Adding CATE and GATE within IRM for one variable #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request introduces 6 alerts when merging c0fc156 into b6d130a - view on LGTM.com new alerts:
|
def cate(self, cate_var: str, method: str, alpha: float, n_grid_nodes: int, | ||
n_samples_bootstrap: int, cv: bool, poly_degree: int = None, splines_knots: int = None, | ||
splines_degree: int = None, ortho: bool = False, x_grid: np.array = None) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code base, we so far don't work with type hints. Such a change of the code style should not come together with a new feature. Therefore, please remove the code hints.
def cate(self, cate_var: str, method: str, alpha: float, n_grid_nodes: int, | |
n_samples_bootstrap: int, cv: bool, poly_degree: int = None, splines_knots: int = None, | |
splines_degree: int = None, ortho: bool = False, x_grid: np.array = None) -> dict: | |
def cate(self, cate_var, method, alpha, n_grid_nodes, | |
n_samples_bootstrap, cv, poly_degree=None, splines_knots=None, | |
splines_degree=None, ortho=False, x_grid=None)t: |
General remark on type hints: Adding type hints is a bigger decision and would also require numerous adaptions in contributing guidelines, etc. Some of the larger python packages encourage to use type hints (PEP 484), see for example pandas
. Others not yet, see e.g. the discussions scikit-learn/scikit-learn#16705 & fairlearn/fairlearn#592.
cate_var: variable whose CATE is calculated | ||
method: "poly" or "splines", chooses which method to approximate with | ||
alpha: p-value for the confidence intervals | ||
n_grid_nodes: number of points of X for which we wish to calculate the CATE | ||
n_samples_bootstrap: how many samples to use to calculate the t-statistics described in 2.6 | ||
cv: whether to perform cross-validation to find the degree of the polynomials / number of knots | ||
poly_degree: maximum degree of the polynomial approximation | ||
splines_knots: max knots for the splines approximation | ||
splines_degree: degree of the polynomials used in the splines | ||
ortho: whether to use orthogonal polynomials | ||
x_grid: grid points on which to evaluate the CATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please align the docstring with the other functionality in the DoubleML package, see for example
doubleml-for-py/doubleml/double_ml_plr.py
Lines 32 to 78 in b6d130a
Parameters | |
---------- | |
obj_dml_data : :class:`DoubleMLData` object | |
The :class:`DoubleMLData` object providing the data and specifying the variables for the causal model. | |
ml_l : estimator implementing ``fit()`` and ``predict()`` | |
A machine learner implementing ``fit()`` and ``predict()`` methods (e.g. | |
:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function :math:`\\ell_0(X) = E[Y|X]`. | |
ml_m : estimator implementing ``fit()`` and ``predict()`` | |
A machine learner implementing ``fit()`` and ``predict()`` methods (e.g. | |
:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function :math:`m_0(X) = E[D|X]`. | |
For binary treatment variables :math:`D` (with values 0 and 1), a classifier implementing ``fit()`` and | |
``predict_proba()`` can also be specified. If :py:func:`sklearn.base.is_classifier` returns ``True``, | |
``predict_proba()`` is used otherwise ``predict()``. | |
ml_g : estimator implementing ``fit()`` and ``predict()`` | |
A machine learner implementing ``fit()`` and ``predict()`` methods (e.g. | |
:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function | |
:math:`g_0(X) = E[Y - D \\theta_0|X]`. | |
Note: The learner `ml_g` is only required for the score ``'IV-type'``. Optionally, it can be specified and | |
estimated for callable scores. | |
n_folds : int | |
Number of folds. | |
Default is ``5``. | |
n_rep : int | |
Number of repetitons for the sample splitting. | |
Default is ``1``. | |
score : str or callable | |
A str (``'partialling out'`` or ``'IV-type'``) specifying the score function | |
or a callable object / function with signature ``psi_a, psi_b = score(y, d, l_hat, m_hat, g_hat, smpls)``. | |
Default is ``'partialling out'``. | |
dml_procedure : str | |
A str (``'dml1'`` or ``'dml2'``) specifying the double machine learning algorithm. | |
Default is ``'dml2'``. | |
draw_sample_splitting : bool | |
Indicates whether the sample splitting should be drawn during initialization of the object. | |
Default is ``True``. | |
apply_cross_fitting : bool | |
Indicates whether cross-fitting should be applied. | |
Default is ``True``. |
This means:
- Add the type in the first line of the docstring
- Add the description of the variable in the proceeding line (with indentation)
- If a default value is specified, state it in the last line
- Add a newline between parameters
- ...
Returns | ||
------- | ||
A dictionary containing the estimated CATE (g_hat), with upper and lower confidence bounds (both simultaneous as | ||
well as pointwise), fitted linear model and grid of X for which the CATE was calculated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be the same format as Parameters, see e.g.
doubleml-for-py/doubleml/double_ml.py
Lines 724 to 731 in b6d130a
Returns | |
------- | |
self : object | |
Returned if ``return_tune_res`` is ``False``. | |
tune_res: list | |
A list containing detailed tuning results and the proposed hyperparameters. | |
Returned if ``return_tune_res`` is ``True``. |
well as pointwise), fitted linear model and grid of X for which the CATE was calculated | ||
|
||
""" | ||
X = np.array(self._dml_data.data[cate_var]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add exception handling for every input variable. The type as well as the value should be checked. For an example, see
doubleml-for-py/doubleml/double_ml.py
Lines 734 to 783 in b6d130a
if (not isinstance(param_grids, dict)) | (not all(k in param_grids for k in self.learner_names)): | |
raise ValueError('Invalid param_grids ' + str(param_grids) + '. ' | |
'param_grids must be a dictionary with keys ' + ' and '.join(self.learner_names) + '.') | |
if scoring_methods is not None: | |
if (not isinstance(scoring_methods, dict)) | (not all(k in self.learner_names for k in scoring_methods)): | |
raise ValueError('Invalid scoring_methods ' + str(scoring_methods) + '. ' + | |
'scoring_methods must be a dictionary. ' + | |
'Valid keys are ' + ' and '.join(self.learner_names) + '.') | |
if not all(k in scoring_methods for k in self.learner_names): | |
# if there are learners for which no scoring_method was set, we fall back to None, i.e., default scoring | |
for learner in self.learner_names: | |
if learner not in scoring_methods: | |
scoring_methods[learner] = None | |
if not isinstance(tune_on_folds, bool): | |
raise TypeError('tune_on_folds must be True or False. ' | |
f'Got {str(tune_on_folds)}.') | |
if not isinstance(n_folds_tune, int): | |
raise TypeError('The number of folds used for tuning must be of int type. ' | |
f'{str(n_folds_tune)} of type {str(type(n_folds_tune))} was passed.') | |
if n_folds_tune < 2: | |
raise ValueError('The number of folds used for tuning must be at least two. ' | |
f'{str(n_folds_tune)} was passed.') | |
if (not isinstance(search_mode, str)) | (search_mode not in ['grid_search', 'randomized_search']): | |
raise ValueError('search_mode must be "grid_search" or "randomized_search". ' | |
f'Got {str(search_mode)}.') | |
if not isinstance(n_iter_randomized_search, int): | |
raise TypeError('The number of parameter settings sampled for the randomized search must be of int type. ' | |
f'{str(n_iter_randomized_search)} of type ' | |
f'{str(type(n_iter_randomized_search))} was passed.') | |
if n_iter_randomized_search < 2: | |
raise ValueError('The number of parameter settings sampled for the randomized search must be at least two. ' | |
f'{str(n_iter_randomized_search)} was passed.') | |
if n_jobs_cv is not None: | |
if not isinstance(n_jobs_cv, int): | |
raise TypeError('The number of CPUs used to fit the learners must be of int type. ' | |
f'{str(n_jobs_cv)} of type {str(type(n_jobs_cv))} was passed.') | |
if not isinstance(set_as_params, bool): | |
raise TypeError('set_as_params must be True or False. ' | |
f'Got {str(set_as_params)}.') | |
if not isinstance(return_tune_res, bool): | |
raise TypeError('return_tune_res must be True or False. ' | |
f'Got {str(return_tune_res)}.') |
if x_grid is None: | ||
print("Authomatically generating grid for CATE evaluation") | ||
x_grid = np.linspace(np.round(np.min(X), 2), np.round(np.max(X), 2), n_grid_nodes).reshape(-1, 1) | ||
else: | ||
print("User-provided grid will be used for CATE evaluation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to not have such print statements. If they are really essential, we should add a verbosity
variable such that users still has the option to turn the printing to the console off.
@@ -1,9 +1,15 @@ | |||
import patsy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really need the pkg patsy
, we also need to list it as requirement of the package.
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove these additional new lines
@@ -0,0 +1,141 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file synthetic_example.py
from the PR. I don't think that the repo containing the package code is the right place for such a demonstration / example. We can transfer it into an example for the gallery (https://docs.doubleml.org/stable/examples/index.html). The source code for the example gallery can be found in the doubleml-docs
repo, see https://github.com/DoubleML/doubleml-docs/tree/master/doc/examples.
import numpy as np | ||
import pandas as pd | ||
from statsmodels.regression.linear_model import RegressionResults | ||
|
||
import doubleml as dml | ||
from doubleml.datasets import fetch_401K | ||
|
||
from sklearn.preprocessing import PolynomialFeatures | ||
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor | ||
from sklearn.linear_model import LassoCV | ||
import statsmodels.api as sm | ||
|
||
from scipy.stats import norm | ||
from scipy.linalg import sqrtm | ||
|
||
import patsy | ||
|
||
import matplotlib.pyplot as plt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check which imports are in fact used. I think you can remove the following
from statsmodels.regression.linear_model import RegressionResults
from doubleml.datasets import fetch_401K
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
import statsmodels.api as sm
from scipy.stats import norm
from scipy.linalg import sqrtm
import patsy
results_dict = { | ||
"g_hat": g_hat, | ||
"g_hat_lower": g_hat_lower, | ||
"g_hat_upper": g_hat_upper, | ||
"g_hat_lower_point": g_hat_lower_point, | ||
"g_hat_upper_point": g_hat_upper_point, | ||
"fitted_model": fitted_model, | ||
"gate_type": gate_type, | ||
"n_quantiles": n_quantiles, | ||
"class_name": class_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for the GATE, I would prefer that result is an object of a class named GATEResult
, EstimatedGATE
or just GATE
, cf .
I will close this issue since the implemtation of CATE and GATE for the IRM model was done with the PR #169 . |
Thanks for contributing to DoubleML.
Before submitting a PR, please take a look at our contribution guidelines.
Additionally, please fill out the PR checklist below.
Description
Added CATE and GATE implementation within IRM model for one conditional variable. (Orthogonal) polynomial and splines approximations are implemented
Synthetic example to be the basic example on how to use it
Reference to Issues or PRs
CATE estimation #46 (on the doubleml-hiwi-sandbox repo)
Comments
I am doing this PR to serve as a basis of discussion for how to incorporate CATE into the package. Further discussions are welcomed and I will do my best to adapt it.
No unit tests have been developed so far for the CATE/GATE. We would need to talk about that as well.
After running the current tests, the following ones failed: test_fetch_bonus_poly, test_irm_exception_with_missings. I'm not sure that's related to my implementation, but it would be great to talk about it.
It would also be great to have a discussion on how to document it in the best way.
PR Checklist
Please fill out this PR checklist (see our contributing guidelines for details).