Skip to content

Add sample selection models #235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
7a29196
Move sample selection model to double_ml_for_py from templates
Nov 11, 2023
f5e2a78
Move sample selection model to double_ml_for_py from templates
Nov 11, 2023
5ce921f
Nuisance estimation for MAR sample selection
Nov 17, 2023
98cde59
Add pi and p propensity scores estimation
Nov 28, 2023
254a1dc
Fix estimation of nuisance functions
Dec 11, 2023
65ac448
fix estimation error caused by S column
Jan 14, 2024
0c7705f
Fix issues with wrong sampling
Jan 28, 2024
6c0a2e5
Start working on nonignorable nonresponse
Feb 3, 2024
ba22c03
Save before rebasing
Feb 9, 2024
64c04f9
Implement estimator for selection under nonignorable nonresponse
Feb 11, 2024
c443a96
Fix estimation errors in MAR and nonignorable nonresponse estimators …
Feb 16, 2024
ce21cc0
Implement weights normalization and start writing tests
Feb 17, 2024
e45d1e4
Write initial tests for MAR
Feb 18, 2024
c9658c2
Write initial tests for nonignorable nonresponse estimator
Feb 18, 2024
694389f
Remove unnecessary imports
Feb 18, 2024
02029f6
Add DoubleMLS to __init__.py for relative imports
Feb 25, 2024
3a9be3c
Add minor comments and change sample size in simulated data in confte…
Feb 25, 2024
38a65fa
Fix formatting according to PEP8
Feb 25, 2024
83f8143
Change model name to DoubleMLSSM and remove line redefining pi_hat
Mar 7, 2024
bf267fd
Raise warning instead of error when instrument is present and MAR sel…
Mar 8, 2024
c005549
Create DGP for sample selection in datasets.py
Mar 9, 2024
f90804d
Add default value tests for sample selection models
Mar 9, 2024
ecf6485
Add tests for SSM return types
Mar 10, 2024
26aab36
Remove z column from data returned in case of MAR
Mar 10, 2024
20ef6af
Fix formatting and change default number of CV folds to 5
Mar 10, 2024
ad19b97
Raise NotImplementedError for sequential conditional independence
Mar 10, 2024
209662e
Add exception tests for SSM
Mar 10, 2024
1f6e341
Change DGP for sample selection tests
Mar 10, 2024
315fe8b
Add binary outcome check for classifier mu and trim observations in _…
Mar 10, 2024
93ed42e
Remove unused imports and variables
Mar 10, 2024
d6e0af7
Change names of nuisance functions
Mar 17, 2024
fc3bf2d
Change score name 'mar' to 'missing-at-random' and set default ipw no…
Mar 17, 2024
b428015
Use _check_score from utilities and allow multiple instruments in cas…
Mar 17, 2024
6d29055
Refactor to use only one splitting procedure and correct ordering, re…
Mar 23, 2024
0e1969b
Rename selection to ssm
Mar 23, 2024
fb6a38f
Fix return type tests after renaming nuisance functions
Mar 24, 2024
7bd83d7
Add .coverage
Mar 27, 2024
6115f25
Save fitted models under nonignorable nonresponse
Mar 27, 2024
7dfeea8
Merge pull request #231 from mychaelka/causalweight_impl
SvenKlaassen Mar 28, 2024
1650d31
Merge branch 'main' into add-sample-selection-models
SvenKlaassen Apr 2, 2024
e04ca54
move and rename ssm.py
SvenKlaassen Apr 2, 2024
bb2e905
add dgps to irm conftest
SvenKlaassen Apr 2, 2024
ea6c3a9
move ssm tests and manual implementation to irm
SvenKlaassen Apr 2, 2024
ab8c142
reset test return_types to framework
SvenKlaassen Apr 2, 2024
49a5825
add returntypes tests for ssm
SvenKlaassen Apr 2, 2024
97024fd
remove dml procedure from ssm
SvenKlaassen Apr 2, 2024
5f01a88
remove apply_cross_fitting from ssm
SvenKlaassen Apr 2, 2024
26bfb25
update default and exception tests for ssm
SvenKlaassen Apr 2, 2024
36a967c
remove dml1 from utils_ssm
SvenKlaassen Apr 2, 2024
c7910f0
remove dml1 naming from return type tests
SvenKlaassen Apr 2, 2024
e0fb02e
fix format
SvenKlaassen Apr 2, 2024
8945e06
fix exception tests
SvenKlaassen Apr 3, 2024
eded2ed
add selection variable to dml data
SvenKlaassen Apr 16, 2024
a98f01f
change xcols setter
SvenKlaassen Apr 29, 2024
7952b03
Update double_ml_data.py
SvenKlaassen Apr 29, 2024
198b7cc
update make_ssm_data to sample selection indicator s
SvenKlaassen Apr 29, 2024
66a4431
update ssm to sample selection indicator s
SvenKlaassen Apr 29, 2024
b1c9448
extend ssm data tests
SvenKlaassen Apr 29, 2024
02c64a1
add tests for s_col_setter and disjoint sets
SvenKlaassen Apr 29, 2024
2d3db2f
remove sensitvity estimation (until theory is finished)
SvenKlaassen Apr 29, 2024
1f621dd
Update test_ssm_exceptions.py
SvenKlaassen Apr 29, 2024
9112848
Update ssm.py
SvenKlaassen Apr 29, 2024
b29eda9
add returntype tests for make_ssm_data
SvenKlaassen Apr 30, 2024
ba09bb4
Update test_ssm_exceptions.py
SvenKlaassen Apr 30, 2024
c9dc1e6
seperate disjoint check for t and s
SvenKlaassen Apr 30, 2024
a36fc28
simplify from_arrays classmethod
SvenKlaassen Apr 30, 2024
1103d48
extend dml_data test for t and s
SvenKlaassen Apr 30, 2024
5b09f66
add test for ssm tuning
SvenKlaassen Apr 30, 2024
a34a40f
Update test_dml_data.py
SvenKlaassen Apr 30, 2024
66f42b0
Update test_dml_data.py
SvenKlaassen Apr 30, 2024
e98547a
update docstring make_ssm_data
SvenKlaassen May 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .coverage
Binary file not shown.
4 changes: 3 additions & 1 deletion doubleml/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from .irm.pq import DoubleMLPQ
from .irm.lpq import DoubleMLLPQ
from .irm.cvar import DoubleMLCVAR
from .irm.ssm import DoubleMLSSM

from .utils.blp import DoubleMLBLP
from .utils.policytree import DoubleMLPolicyTree
Expand All @@ -32,6 +33,7 @@
'DoubleMLLPQ',
'DoubleMLCVAR',
'DoubleMLBLP',
'DoubleMLPolicyTree']
'DoubleMLPolicyTree',
'DoubleMLSSM']

__version__ = get_distribution('doubleml').version
88 changes: 88 additions & 0 deletions doubleml/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -1345,3 +1345,91 @@ def treatment_effect(x):
'effects': te,
'treatment_effect': treatment_effect}
return res_dict


def make_ssm_data(n_obs=8000, dim_x=100, theta=1, mar=True, return_type='DoubleMLData'):
"""
Generates data from a sample selection model (SSM).
The data generating process is defined as

.. math::

y_i &= \\theta d_i + x_i' \\beta d_i + u_i,

s_i &= 1\\left\\lbrace d_i + \\gamma z_i + x_i' \\beta + v_i > 0 \\right\\rbrace,

d_i &= 1\\left\\lbrace x_i' \\beta + w_i > 0 \\right\\rbrace,

with Y being observed if :math:`s_i = 1` and covariates :math:`x_i \\sim \\mathcal{N}(0, \\Sigma^2_x)`, where
:math:`\\Sigma^2_x` is a matrix with entries
:math:`\\Sigma_{kj} = 0.5^{|j-k|}`.
:math:`\\beta` is a `dim_x`-vector with entries :math:`\\beta_j=\\frac{0.4}{j^2}`
:math:`z_i \\sim \\mathcal{N}(0, 1)`,
:math:`(u_i,v_i) \\sim \\mathcal{N}(0, \\Sigma^2_{u,v})`,
:math:`w_i \\sim \\mathcal{N}(0, 1)`.


The data generating process is inspired by a process used in the simulation study (see Appendix E) of Bia,
Huber and Lafférs (2023).

Parameters
----------
n_obs :
The number of observations to simulate.
dim_x :
The number of covariates.
theta :
The value of the causal parameter.
mar:
Boolean. Indicates whether missingness at random holds.
return_type :
If ``'DoubleMLData'`` or ``DoubleMLData``, returns a ``DoubleMLData`` object.

If ``'DataFrame'``, ``'pd.DataFrame'`` or ``pd.DataFrame``, returns a ``pd.DataFrame``.

If ``'array'``, ``'np.ndarray'``, ``'np.array'`` or ``np.ndarray``, returns ``np.ndarray``'s ``(x, y, d, z, s)``.

References
----------
Michela Bia, Martin Huber & Lukáš Lafférs (2023) Double Machine Learning for Sample Selection Models,
Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2023.2271071
"""
if mar:
sigma = np.array([[1, 0], [0, 1]])
gamma = 0
else:
sigma = np.array([[1, 0.8], [0.8, 1]])
gamma = 1

e = np.random.multivariate_normal(mean=[0, 0], cov=sigma, size=n_obs).T

cov_mat = toeplitz([np.power(0.5, k) for k in range(dim_x)])
x = np.random.multivariate_normal(np.zeros(dim_x), cov_mat, size=[n_obs, ])

beta = [0.4 / (k**2) for k in range(1, dim_x + 1)]

d = np.where(np.dot(x, beta) + np.random.randn(n_obs) > 0, 1, 0)
z = np.random.randn(n_obs)
s = np.where(np.dot(x, beta) + d + gamma * z + e[0] > 0, 1, 0)

y = np.dot(x, beta) + theta * d + e[1]
y[s == 0] = 0

if return_type in _array_alias:
return x, y, d, z, s
elif return_type in _data_frame_alias + _dml_data_alias:
x_cols = [f'X{i + 1}' for i in np.arange(dim_x)]
if mar:
data = pd.DataFrame(np.column_stack((x, y, d, s)),
columns=x_cols + ['y', 'd', 's'])
else:
data = pd.DataFrame(np.column_stack((x, y, d, z, s)),
columns=x_cols + ['y', 'd', 'z', 's'])
if return_type in _data_frame_alias:
return data
else:
if mar:
return DoubleMLData(data, 'y', 'd', x_cols, None, None, 's')
return DoubleMLData(data, 'y', 'd', x_cols, 'z', None, 's')
else:
raise ValueError('Invalid return_type.')
Loading
Loading