Adding CATE and GATE within IRM for one variable #156

CarlesRieraA · 2022-07-28T14:02:10Z

Thanks for contributing to DoubleML.
Before submitting a PR, please take a look at our contribution guidelines.
Additionally, please fill out the PR checklist below.

Description

Added CATE and GATE implementation within IRM model for one conditional variable. (Orthogonal) polynomial and splines approximations are implemented
Synthetic example to be the basic example on how to use it

Reference to Issues or PRs

CATE estimation #46 (on the doubleml-hiwi-sandbox repo)

Comments

I am doing this PR to serve as a basis of discussion for how to incorporate CATE into the package. Further discussions are welcomed and I will do my best to adapt it.
No unit tests have been developed so far for the CATE/GATE. We would need to talk about that as well.
After running the current tests, the following ones failed: test_fetch_bonus_poly, test_irm_exception_with_missings. I'm not sure that's related to my implementation, but it would be great to talk about it.
It would also be great to have a discussion on how to document it in the best way.

PR Checklist

Please fill out this PR checklist (see our contributing guidelines for details).

The title of the pull request summarizes the changes made.
The PR contains a detailed description of all changes and additions.
References to related issues or PRs are added.
The code passes all (unit) tests.
Enhancements or new feature are equipped with unit tests.
The changes adhere to the PEP8 standards.

lgtm-com · 2022-07-28T14:19:48Z

This pull request introduces 6 alerts when merging c0fc156 into b6d130a - view on LGTM.com

new alerts:

6 for Unused import

MalteKurz · 2022-08-03T08:18:52Z

doubleml/double_ml_irm.py

+    def cate(self, cate_var: str, method: str, alpha: float, n_grid_nodes: int,
+             n_samples_bootstrap: int, cv: bool, poly_degree: int = None, splines_knots: int = None,
+             splines_degree: int = None, ortho: bool = False, x_grid: np.array = None) -> dict:


In the code base, we so far don't work with type hints. Such a change of the code style should not come together with a new feature. Therefore, please remove the code hints.

Suggested change

def cate(self, cate_var: str, method: str, alpha: float, n_grid_nodes: int,

n_samples_bootstrap: int, cv: bool, poly_degree: int = None, splines_knots: int = None,

splines_degree: int = None, ortho: bool = False, x_grid: np.array = None) -> dict:

def cate(self, cate_var, method, alpha, n_grid_nodes,

n_samples_bootstrap, cv, poly_degree=None, splines_knots=None,

splines_degree=None, ortho=False, x_grid=None)t:

General remark on type hints: Adding type hints is a bigger decision and would also require numerous adaptions in contributing guidelines, etc. Some of the larger python packages encourage to use type hints (PEP 484), see for example pandas. Others not yet, see e.g. the discussions scikit-learn/scikit-learn#16705 & fairlearn/fairlearn#592.

MalteKurz · 2022-08-03T08:26:56Z

doubleml/double_ml_irm.py

+        cate_var: variable whose CATE is calculated
+        method: "poly" or "splines", chooses which method to approximate with
+        alpha: p-value for the confidence intervals
+        n_grid_nodes: number of points of X for which we wish to calculate the CATE
+        n_samples_bootstrap: how many samples to use to calculate the t-statistics described in 2.6
+        cv: whether to perform cross-validation to find the degree of the polynomials / number of knots
+        poly_degree: maximum degree of the polynomial approximation
+        splines_knots: max knots for the splines approximation
+        splines_degree: degree of the polynomials used in the splines
+        ortho: whether to use orthogonal polynomials
+        x_grid: grid points on which to evaluate the CATE


Please align the docstring with the other functionality in the DoubleML package, see for example

doubleml-for-py/doubleml/double_ml_plr.py

Lines 32 to 78 in b6d130a

Parameters

----------

obj_dml_data : :class:`DoubleMLData` object

The :class:`DoubleMLData` object providing the data and specifying the variables for the causal model.

ml_l : estimator implementing ``fit()`` and ``predict()``

A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.

:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function :math:`\\ell_0(X) = E[Y|X]`.

ml_m : estimator implementing ``fit()`` and ``predict()``

A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.

:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function :math:`m_0(X) = E[D|X]`.

For binary treatment variables :math:`D` (with values 0 and 1), a classifier implementing ``fit()`` and

``predict_proba()`` can also be specified. If :py:func:`sklearn.base.is_classifier` returns ``True``,

``predict_proba()`` is used otherwise ``predict()``.

ml_g : estimator implementing ``fit()`` and ``predict()``

A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.

:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function

:math:`g_0(X) = E[Y - D \\theta_0|X]`.

Note: The learner `ml_g` is only required for the score ``'IV-type'``. Optionally, it can be specified and

estimated for callable scores.

n_folds : int

Number of folds.

Default is ``5``.

n_rep : int

Number of repetitons for the sample splitting.

Default is ``1``.

score : str or callable

A str (``'partialling out'`` or ``'IV-type'``) specifying the score function

or a callable object / function with signature ``psi_a, psi_b = score(y, d, l_hat, m_hat, g_hat, smpls)``.

Default is ``'partialling out'``.

dml_procedure : str

A str (``'dml1'`` or ``'dml2'``) specifying the double machine learning algorithm.

Default is ``'dml2'``.

draw_sample_splitting : bool

Indicates whether the sample splitting should be drawn during initialization of the object.

Default is ``True``.

apply_cross_fitting : bool

Indicates whether cross-fitting should be applied.

Default is ``True``.

This means:

Add the type in the first line of the docstring

Add the description of the variable in the proceeding line (with indentation)

If a default value is specified, state it in the last line

Add a newline between parameters

...

MalteKurz · 2022-08-03T08:27:20Z

doubleml/double_ml_irm.py

+        Returns
+        -------
+        A dictionary containing the estimated CATE (g_hat), with upper and lower confidence bounds (both simultaneous as
+        well as pointwise), fitted linear model and grid of X for which the CATE was calculated


Should be the same format as Parameters, see e.g.

doubleml-for-py/doubleml/double_ml.py

Lines 724 to 731 in b6d130a

Returns

-------

self : object

Returned if ``return_tune_res`` is ``False``.

tune_res: list

A list containing detailed tuning results and the proposed hyperparameters.

Returned if ``return_tune_res`` is ``True``.

MalteKurz · 2022-08-03T08:31:03Z

doubleml/double_ml_irm.py

+        well as pointwise), fitted linear model and grid of X for which the CATE was calculated
+
+        """
+        X = np.array(self._dml_data.data[cate_var])


Please add exception handling for every input variable. The type as well as the value should be checked. For an example, see

doubleml-for-py/doubleml/double_ml.py

Lines 734 to 783 in b6d130a

if (not isinstance(param_grids, dict)) | (not all(k in param_grids for k in self.learner_names)):

raise ValueError('Invalid param_grids ' + str(param_grids) + '. '

'param_grids must be a dictionary with keys ' + ' and '.join(self.learner_names) + '.')

if scoring_methods is not None:

if (not isinstance(scoring_methods, dict)) | (not all(k in self.learner_names for k in scoring_methods)):

raise ValueError('Invalid scoring_methods ' + str(scoring_methods) + '. ' +

'scoring_methods must be a dictionary. ' +

'Valid keys are ' + ' and '.join(self.learner_names) + '.')

if not all(k in scoring_methods for k in self.learner_names):

# if there are learners for which no scoring_method was set, we fall back to None, i.e., default scoring

for learner in self.learner_names:

if learner not in scoring_methods:

scoring_methods[learner] = None

if not isinstance(tune_on_folds, bool):

raise TypeError('tune_on_folds must be True or False. '

f'Got {str(tune_on_folds)}.')

if not isinstance(n_folds_tune, int):

raise TypeError('The number of folds used for tuning must be of int type. '

f'{str(n_folds_tune)} of type {str(type(n_folds_tune))} was passed.')

if n_folds_tune < 2:

raise ValueError('The number of folds used for tuning must be at least two. '

f'{str(n_folds_tune)} was passed.')

if (not isinstance(search_mode, str)) | (search_mode not in ['grid_search', 'randomized_search']):

raise ValueError('search_mode must be "grid_search" or "randomized_search". '

f'Got {str(search_mode)}.')

if not isinstance(n_iter_randomized_search, int):

raise TypeError('The number of parameter settings sampled for the randomized search must be of int type. '

f'{str(n_iter_randomized_search)} of type '

f'{str(type(n_iter_randomized_search))} was passed.')

if n_iter_randomized_search < 2:

raise ValueError('The number of parameter settings sampled for the randomized search must be at least two. '

f'{str(n_iter_randomized_search)} was passed.')

if n_jobs_cv is not None:

if not isinstance(n_jobs_cv, int):

raise TypeError('The number of CPUs used to fit the learners must be of int type. '

f'{str(n_jobs_cv)} of type {str(type(n_jobs_cv))} was passed.')

if not isinstance(set_as_params, bool):

raise TypeError('set_as_params must be True or False. '

f'Got {str(set_as_params)}.')

if not isinstance(return_tune_res, bool):

raise TypeError('return_tune_res must be True or False. '

f'Got {str(return_tune_res)}.')

Each exception handling should additionally be equipped with unit tests, to make sure that throwing the exceptions works as intended, see e.g. https://github.com/DoubleML/doubleml-for-py/blob/master/doubleml/tests/test_doubleml_exceptions.py.

MalteKurz · 2022-08-03T08:34:19Z

doubleml/double_ml_irm.py

+        if x_grid is None:
+            print("Authomatically generating grid for CATE evaluation")
+            x_grid = np.linspace(np.round(np.min(X), 2), np.round(np.max(X), 2), n_grid_nodes).reshape(-1, 1)
+        else:
+            print("User-provided grid will be used for CATE evaluation")


I prefer to not have such print statements. If they are really essential, we should add a verbosity variable such that users still has the option to turn the printing to the console off.

MalteKurz · 2022-08-03T08:46:44Z

doubleml/double_ml_irm.py

@@ -1,9 +1,15 @@
+import patsy


If we really need the pkg patsy, we also need to list it as requirement of the package.

MalteKurz · 2022-08-03T09:01:26Z

doubleml/double_ml_irm.py

+
+

Please remove these additional new lines

MalteKurz · 2022-08-03T09:04:47Z

synthetic_example.py

@@ -0,0 +1,141 @@
+import numpy as np


Please remove this file synthetic_example.py from the PR. I don't think that the repo containing the package code is the right place for such a demonstration / example. We can transfer it into an example for the gallery (https://docs.doubleml.org/stable/examples/index.html). The source code for the example gallery can be found in the doubleml-docs repo, see https://github.com/DoubleML/doubleml-docs/tree/master/doc/examples.

MalteKurz · 2022-08-03T09:06:22Z

synthetic_example.py

+import numpy as np
+import pandas as pd
+from statsmodels.regression.linear_model import RegressionResults
+
+import doubleml as dml
+from doubleml.datasets import fetch_401K
+
+from sklearn.preprocessing import PolynomialFeatures
+from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
+from sklearn.linear_model import LassoCV
+import statsmodels.api as sm
+
+from scipy.stats import norm
+from scipy.linalg import sqrtm
+
+import patsy
+
+import matplotlib.pyplot as plt


Please check which imports are in fact used. I think you can remove the following

from statsmodels.regression.linear_model import RegressionResults from doubleml.datasets import fetch_401K from sklearn.preprocessing import PolynomialFeatures from sklearn.ensemble import RandomForestRegressor import statsmodels.api as sm from scipy.stats import norm from scipy.linalg import sqrtm import patsy

MalteKurz · 2022-08-03T09:14:57Z

doubleml/double_ml_irm.py

+        results_dict = {
+            "g_hat": g_hat,
+            "g_hat_lower": g_hat_lower,
+            "g_hat_upper": g_hat_upper,
+            "g_hat_lower_point": g_hat_lower_point,
+            "g_hat_upper_point": g_hat_upper_point,
+            "fitted_model": fitted_model,
+            "gate_type": gate_type,
+            "n_quantiles": n_quantiles,
+            "class_name": class_names


Also for the GATE, I would prefer that result is an object of a class named GATEResult, EstimatedGATE or just GATE, cf .

SvenKlaassen · 2023-01-06T15:11:01Z

I will close this issue since the implemtation of CATE and GATE for the IRM model was done with the PR #169 .

Riera Auge added 8 commits January 24, 2022 08:33

added synthetic example with CATEs

688d736

first OOP approach for CATE

822eb46

changed definition of X in cate function

508c3de

added missing package

b4a4010

added gate functions into irm class

5db29cb

fixed issue with spikes in cate

9ab8e93

updated synthetic example to make use of class method

b095e93

Merge branch 'master' into c-cate

c0fc156

CarlesRieraA requested review from MalteKurz and PhilippBach July 28, 2022 14:02

CarlesRieraA marked this pull request as draft July 28, 2022 14:28

MalteKurz requested changes Aug 3, 2022

View reviewed changes

SvenKlaassen closed this Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding CATE and GATE within IRM for one variable #156

Adding CATE and GATE within IRM for one variable #156

Uh oh!

CarlesRieraA commented Jul 28, 2022 •

edited

Loading

Uh oh!

lgtm-com bot commented Jul 28, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

MalteKurz Aug 3, 2022

Uh oh!

SvenKlaassen commented Jan 6, 2023

Uh oh!

Uh oh!

	Parameters
	----------
	obj_dml_data : :class:`DoubleMLData` object
	The :class:`DoubleMLData` object providing the data and specifying the variables for the causal model.

	ml_l : estimator implementing ``fit()`` and ``predict()``
	A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.
	:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function :math:`\\ell_0(X) = E[Y\|X]`.

	ml_m : estimator implementing ``fit()`` and ``predict()``
	A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.
	:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function :math:`m_0(X) = E[D\|X]`.
	For binary treatment variables :math:`D` (with values 0 and 1), a classifier implementing ``fit()`` and
	``predict_proba()`` can also be specified. If :py:func:`sklearn.base.is_classifier` returns ``True``,
	``predict_proba()`` is used otherwise ``predict()``.

	ml_g : estimator implementing ``fit()`` and ``predict()``
	A machine learner implementing ``fit()`` and ``predict()`` methods (e.g.
	:py:class:`sklearn.ensemble.RandomForestRegressor`) for the nuisance function
	:math:`g_0(X) = E[Y - D \\theta_0\|X]`.
	Note: The learner `ml_g` is only required for the score ``'IV-type'``. Optionally, it can be specified and
	estimated for callable scores.

	n_folds : int
	Number of folds.
	Default is ``5``.

	n_rep : int
	Number of repetitons for the sample splitting.
	Default is ``1``.

	score : str or callable
	A str (``'partialling out'`` or ``'IV-type'``) specifying the score function
	or a callable object / function with signature ``psi_a, psi_b = score(y, d, l_hat, m_hat, g_hat, smpls)``.
	Default is ``'partialling out'``.

	dml_procedure : str
	A str (``'dml1'`` or ``'dml2'``) specifying the double machine learning algorithm.
	Default is ``'dml2'``.

	draw_sample_splitting : bool
	Indicates whether the sample splitting should be drawn during initialization of the object.
	Default is ``True``.

	apply_cross_fitting : bool
	Indicates whether cross-fitting should be applied.
	Default is ``True``.

	Returns
	-------
	self : object
	Returned if ``return_tune_res`` is ``False``.

	tune_res: list
	A list containing detailed tuning results and the proposed hyperparameters.
	Returned if ``return_tune_res`` is ``True``.

	if (not isinstance(param_grids, dict)) \| (not all(k in param_grids for k in self.learner_names)):
	raise ValueError('Invalid param_grids ' + str(param_grids) + '. '
	'param_grids must be a dictionary with keys ' + ' and '.join(self.learner_names) + '.')

	if scoring_methods is not None:
	if (not isinstance(scoring_methods, dict)) \| (not all(k in self.learner_names for k in scoring_methods)):
	raise ValueError('Invalid scoring_methods ' + str(scoring_methods) + '. ' +
	'scoring_methods must be a dictionary. ' +
	'Valid keys are ' + ' and '.join(self.learner_names) + '.')
	if not all(k in scoring_methods for k in self.learner_names):
	# if there are learners for which no scoring_method was set, we fall back to None, i.e., default scoring
	for learner in self.learner_names:
	if learner not in scoring_methods:
	scoring_methods[learner] = None

	if not isinstance(tune_on_folds, bool):
	raise TypeError('tune_on_folds must be True or False. '
	f'Got {str(tune_on_folds)}.')

	if not isinstance(n_folds_tune, int):
	raise TypeError('The number of folds used for tuning must be of int type. '
	f'{str(n_folds_tune)} of type {str(type(n_folds_tune))} was passed.')
	if n_folds_tune < 2:
	raise ValueError('The number of folds used for tuning must be at least two. '
	f'{str(n_folds_tune)} was passed.')

	if (not isinstance(search_mode, str)) \| (search_mode not in ['grid_search', 'randomized_search']):
	raise ValueError('search_mode must be "grid_search" or "randomized_search". '
	f'Got {str(search_mode)}.')

	if not isinstance(n_iter_randomized_search, int):
	raise TypeError('The number of parameter settings sampled for the randomized search must be of int type. '
	f'{str(n_iter_randomized_search)} of type '
	f'{str(type(n_iter_randomized_search))} was passed.')
	if n_iter_randomized_search < 2:
	raise ValueError('The number of parameter settings sampled for the randomized search must be at least two. '
	f'{str(n_iter_randomized_search)} was passed.')

	if n_jobs_cv is not None:
	if not isinstance(n_jobs_cv, int):
	raise TypeError('The number of CPUs used to fit the learners must be of int type. '
	f'{str(n_jobs_cv)} of type {str(type(n_jobs_cv))} was passed.')

	if not isinstance(set_as_params, bool):
	raise TypeError('set_as_params must be True or False. '
	f'Got {str(set_as_params)}.')

	if not isinstance(return_tune_res, bool):
	raise TypeError('return_tune_res must be True or False. '
	f'Got {str(return_tune_res)}.')

Adding CATE and GATE within IRM for one variable #156

Adding CATE and GATE within IRM for one variable #156

Uh oh!

Conversation

CarlesRieraA commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reference to Issues or PRs

Comments

PR Checklist

Uh oh!

lgtm-com bot commented Jul 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SvenKlaassen commented Jan 6, 2023

Uh oh!

Uh oh!

CarlesRieraA commented Jul 28, 2022 •

edited

Loading