Skip to content

Implementation of sample selection estimators #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
7a29196
Move sample selection model to double_ml_for_py from templates
Nov 11, 2023
f5e2a78
Move sample selection model to double_ml_for_py from templates
Nov 11, 2023
5ce921f
Nuisance estimation for MAR sample selection
Nov 17, 2023
98cde59
Add pi and p propensity scores estimation
Nov 28, 2023
254a1dc
Fix estimation of nuisance functions
Dec 11, 2023
65ac448
fix estimation error caused by S column
Jan 14, 2024
0c7705f
Fix issues with wrong sampling
Jan 28, 2024
6c0a2e5
Start working on nonignorable nonresponse
Feb 3, 2024
ba22c03
Save before rebasing
Feb 9, 2024
64c04f9
Implement estimator for selection under nonignorable nonresponse
Feb 11, 2024
c443a96
Fix estimation errors in MAR and nonignorable nonresponse estimators …
Feb 16, 2024
ce21cc0
Implement weights normalization and start writing tests
Feb 17, 2024
e45d1e4
Write initial tests for MAR
Feb 18, 2024
c9658c2
Write initial tests for nonignorable nonresponse estimator
Feb 18, 2024
694389f
Remove unnecessary imports
Feb 18, 2024
02029f6
Add DoubleMLS to __init__.py for relative imports
Feb 25, 2024
3a9be3c
Add minor comments and change sample size in simulated data in confte…
Feb 25, 2024
38a65fa
Fix formatting according to PEP8
Feb 25, 2024
83f8143
Change model name to DoubleMLSSM and remove line redefining pi_hat
Mar 7, 2024
bf267fd
Raise warning instead of error when instrument is present and MAR sel…
Mar 8, 2024
c005549
Create DGP for sample selection in datasets.py
Mar 9, 2024
f90804d
Add default value tests for sample selection models
Mar 9, 2024
ecf6485
Add tests for SSM return types
Mar 10, 2024
26aab36
Remove z column from data returned in case of MAR
Mar 10, 2024
20ef6af
Fix formatting and change default number of CV folds to 5
Mar 10, 2024
ad19b97
Raise NotImplementedError for sequential conditional independence
Mar 10, 2024
209662e
Add exception tests for SSM
Mar 10, 2024
1f6e341
Change DGP for sample selection tests
Mar 10, 2024
315fe8b
Add binary outcome check for classifier mu and trim observations in _…
Mar 10, 2024
93ed42e
Remove unused imports and variables
Mar 10, 2024
d6e0af7
Change names of nuisance functions
Mar 17, 2024
fc3bf2d
Change score name 'mar' to 'missing-at-random' and set default ipw no…
Mar 17, 2024
b428015
Use _check_score from utilities and allow multiple instruments in cas…
Mar 17, 2024
6d29055
Refactor to use only one splitting procedure and correct ordering, re…
Mar 23, 2024
0e1969b
Rename selection to ssm
Mar 23, 2024
fb6a38f
Fix return type tests after renaming nuisance functions
Mar 24, 2024
7bd83d7
Add .coverage
Mar 27, 2024
6115f25
Save fitted models under nonignorable nonresponse
Mar 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .coverage
Binary file not shown.
4 changes: 3 additions & 1 deletion doubleml/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from .double_ml_lpq import DoubleMLLPQ
from .double_ml_cvar import DoubleMLCVAR
from .double_ml_policytree import DoubleMLPolicyTree
from .double_ml_ssm import DoubleMLSSM

__all__ = ['DoubleMLPLR',
'DoubleMLPLIV',
Expand All @@ -27,6 +28,7 @@
'DoubleMLQTE',
'DoubleMLLPQ',
'DoubleMLCVAR',
'DoubleMLPolicyTree']
'DoubleMLPolicyTree',
'DoubleMLSSM']

__version__ = get_distribution('doubleml').version
88 changes: 88 additions & 0 deletions doubleml/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -1345,3 +1345,91 @@ def treatment_effect(x):
'effects': te,
'treatment_effect': treatment_effect}
return res_dict


def make_ssm_data(n_obs=8000, dim_x=100, theta=1, mar=True, return_type='DoubleMLData'):
"""
Generates data from a sample selection model (SSM).
The data generating process is defined as

.. math::

y_i &= \\theta d_i + x_i' \\beta d_i + u_i, & with Y being observed if s = 1,

s_i &= 1\\left\\lbrace d_i + \\gamma z_i + x_i' \\beta + v_i > 0 \\right\\rbrace, & &d_i
= 1\\left\\lbrace x_i' \\beta + w_i > 0 \\right\\rbrace,


with covariates :math:`x_i \\sim \\mathcal{N}(0, \\Sigma^2_x)`, where
:math:`\\Sigma^2_x` is a matrix with entries
:math:`\\Sigma_{kj} = 0.5^{|j-k|}`.
:math:`\\beta` is a `dim_x`-vector with entries :math:`\\beta_j=\\frac{0.4}{j^2}`
:math:`z_i \\sim \\mathcal{N}(0, 1)`,
:math:`(u_i,v_i) \\sim \\mathcal{N}(0, \\Sigma^2_{u,v})`,
:math:`w_i \\sim \\mathcal{N}(0, 1)`


The data generating process is inspired by a process used in the simulation study (see Appendix E) of Bia,
Huber and Lafférs (2023).

Parameters
----------
n_obs :
The number of observations to simulate.
dim_x :
The number of covariates.
theta :
The value of the causal parameter.
mar:
Boolean. Indicates whether missingness at random holds.
return_type :
If ``'DoubleMLData'`` or ``DoubleMLData``, returns a ``DoubleMLData`` object.

If ``'DataFrame'``, ``'pd.DataFrame'`` or ``pd.DataFrame``, returns a ``pd.DataFrame``.

If ``'array'``, ``'np.ndarray'``, ``'np.array'`` or ``np.ndarray``, returns ``np.ndarray``'s ``(x, y, d, z, s)``.

References
----------
Michela Bia, Martin Huber & Lukáš Lafférs (2023) Double Machine Learning for Sample Selection Models,
Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2023.2271071
"""
if mar:
sigma = np.array([[1, 0], [0, 1]])
gamma = 0
else:
sigma = np.array([[1, 0.8], [0.8, 1]])
gamma = 1

e = np.random.multivariate_normal(mean=[0, 0], cov=sigma, size=n_obs).T

cov_mat = toeplitz([np.power(0.5, k) for k in range(dim_x)])
x = np.random.multivariate_normal(np.zeros(dim_x), cov_mat, size=[n_obs, ])

beta = [0.4 / (k**2) for k in range(1, dim_x + 1)]

d = np.where(np.dot(x, beta) + np.random.randn(n_obs) > 0, 1, 0)
z = np.random.randn(n_obs)
s = np.where(np.dot(x, beta) + d + gamma * z + e[0] > 0, 1, 0)

y = np.dot(x, beta) + theta * d + e[1]
y[s == 0] = 0

if return_type in _array_alias:
return x, y, d, z, s
elif return_type in _data_frame_alias + _dml_data_alias:
x_cols = [f'X{i + 1}' for i in np.arange(dim_x)]
if mar:
data = pd.DataFrame(np.column_stack((x, y, d, s)),
columns=x_cols + ['y', 'd', 's'])
else:
data = pd.DataFrame(np.column_stack((x, y, d, z, s)),
columns=x_cols + ['y', 'd', 'z', 's'])
if return_type in _data_frame_alias:
return data
else:
if mar:
return DoubleMLData(data, 'y', 'd', x_cols, None, 's')
return DoubleMLData(data, 'y', 'd', x_cols, 'z', 's')
else:
raise ValueError('Invalid return_type.')
Loading