Add sample selection models #235

SvenKlaassen · 2024-03-28T06:34:06Z

Description

This PR contains an implementation of two estimators of sample selection models from Michela Bia, Martin Huber & Lukáš Lafférs (2023) Double Machine Learning for Sample Selection Models, Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2023.2271071 -- identification under missingness at random and under nonignorable nonresponse, along with basic tests on simulated data. For testing, the file conftest.py was also modified to include the DGP for these models. Original implementation of these estimators is available in the R causalweight package (https://cran.r-project.org/web/packages/causalweight/index.html).

Reference to Issues or PRs

Implemented from @mychaelka. Original PR #231

Comments

These estimators require a sample selection indicator to be present in the data (1 if outcome is observed, 0 otherwise). The DoubleMLData interface does not have a selection indicator available yet, so the implementation uses the time indicator t in its place. The third estimator in the paper (identification under sequential conditional independence) is not implemented yet, as it would require interfering with the implementation of DoubleMLData, as it requires the covariates to be split into two parts -- observed pre-treatment and observed post-treatment.

Additional Changes

Remove apply_crossfitting and dml_procedure
Update to use DoubleMLFramework class
Add selection indicator to DoubleMLData
Implement external predictions (not yet)
Implement sensitvity analysis (not yet)

PR Checklist

Please fill out this PR checklist (see our contributing guidelines for details).

The title of the pull request summarizes the changes made.
The PR contains a detailed description of all changes and additions.
References to related issues or PRs are added.
The code passes all (unit) tests.
Enhancements or new feature are equipped with unit tests.
The changes adhere to the PEP8 standards.

…and create new clean class

…st.py

…ected

…utils_selection_manual

remove tests for plot_tree()

doubleml/double_ml_data.py

doubleml/irm/ssm.py

SvenKlaassen · 2024-04-30T09:32:23Z

@mychaelka i have updated the data class with a selection indicator s and extended the unit tests.
Further, I have added sampling stratification for both scores to the top of the class definition

doubleml-for-py/doubleml/irm/ssm.py

Line 138 in 5b09f66

    
           self._strata = self._dml_data.d.reshape(-1, 1) + 2 * self._dml_data.s.reshape(-1, 1)

Do you think the stratification by treatment and selection variable is fine for both scores?

I would like to update the documentation over the next weeks. Can you send me your simulation notebook (via mail if possible).
I am sorry, that the changes were quite slow. I was quite busy over the last month.

mychaelka · 2024-05-02T07:51:26Z

@SvenKlaassen thank you, and yes, the stratification should be fine for both. I could send the notebook(s) tomorrow, but right now they only contain some simulations with only a few comments. I can adjust them to look similar to the example ones that you have already available during the weekend and send the final version next week.
And no need to apologize for being busy, I am in a similar situation right now :)

SvenKlaassen · 2024-05-02T10:42:44Z

Thank you.
A slightly adjusted version would be great but you can take your time. It doesn't need to be next week.

Michaela Kecskésová added 30 commits February 9, 2024 10:25

Move sample selection model to double_ml_for_py from templates

7a29196

Move sample selection model to double_ml_for_py from templates

f5e2a78

Nuisance estimation for MAR sample selection

5ce921f

Add pi and p propensity scores estimation

98cde59

Fix estimation of nuisance functions

254a1dc

fix estimation error caused by S column

65ac448

Fix issues with wrong sampling

0c7705f

Start working on nonignorable nonresponse

6c0a2e5

Save before rebasing

ba22c03

Implement estimator for selection under nonignorable nonresponse

64c04f9

Fix estimation errors in MAR and nonignorable nonresponse estimators …

c443a96

…and create new clean class

Implement weights normalization and start writing tests

ce21cc0

Write initial tests for MAR

e45d1e4

Write initial tests for nonignorable nonresponse estimator

c9658c2

Remove unnecessary imports

694389f

Add DoubleMLS to __init__.py for relative imports

02029f6

Add minor comments and change sample size in simulated data in confte…

3a9be3c

…st.py

Fix formatting according to PEP8

38a65fa

Change model name to DoubleMLSSM and remove line redefining pi_hat

83f8143

Raise warning instead of error when instrument is present and MAR sel…

bf267fd

…ected

Create DGP for sample selection in datasets.py

c005549

Add default value tests for sample selection models

f90804d

Add tests for SSM return types

ecf6485

Remove z column from data returned in case of MAR

26aab36

Fix formatting and change default number of CV folds to 5

20ef6af

Raise NotImplementedError for sequential conditional independence

ad19b97

Add exception tests for SSM

209662e

Change DGP for sample selection tests

1f6e341

Add binary outcome check for classifier mu and trim observations in _…

315fe8b

…utils_selection_manual

Remove unused imports and variables

93ed42e

SvenKlaassen added 6 commits April 2, 2024 13:17

update default and exception tests for ssm

26bfb25

remove dml1 from utils_ssm

36a967c

remove dml1 naming from return type tests

c7910f0

fix format

e0fb02e

fix exception tests

8945e06

remove tests for plot_tree()

add selection variable to dml data

eded2ed

github-advanced-security bot found potential problems Apr 16, 2024

View reviewed changes

doubleml/double_ml_data.py Fixed Show fixed Hide fixed

doubleml/double_ml_data.py Fixed Show fixed Hide fixed

doubleml/double_ml_data.py Fixed Show fixed Hide fixed

doubleml/double_ml_data.py Fixed Show fixed Hide fixed

doubleml/double_ml_data.py Fixed Show fixed Hide fixed

SvenKlaassen added 15 commits April 29, 2024 06:18

change xcols setter

a98f01f

Update double_ml_data.py

7952b03

update make_ssm_data to sample selection indicator s

198b7cc

update ssm to sample selection indicator s

66a4431

extend ssm data tests

b1c9448

add tests for s_col_setter and disjoint sets

02c64a1

remove sensitvity estimation (until theory is finished)

2d3db2f

Update test_ssm_exceptions.py

1f621dd

Update ssm.py

9112848

add returntype tests for make_ssm_data

b29eda9

Update test_ssm_exceptions.py

ba09bb4

seperate disjoint check for t and s

c9dc1e6

simplify from_arrays classmethod

a36fc28

extend dml_data test for t and s

1103d48

add test for ssm tuning

5b09f66

github-advanced-security bot found potential problems Apr 30, 2024

View reviewed changes

doubleml/irm/ssm.py Dismissed Show dismissed Hide dismissed

SvenKlaassen added 2 commits April 30, 2024 12:21

Update test_dml_data.py

a34a40f

Update test_dml_data.py

66f42b0

update docstring make_ssm_data

e98547a

SvenKlaassen merged commit f04fef0 into main May 29, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sample selection models #235

Add sample selection models #235

Uh oh!

SvenKlaassen commented Mar 28, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SvenKlaassen commented Apr 30, 2024

Uh oh!

mychaelka commented May 2, 2024

Uh oh!

SvenKlaassen commented May 2, 2024

Uh oh!

Uh oh!

Uh oh!

Add sample selection models #235

Add sample selection models #235

Uh oh!

Conversation

SvenKlaassen commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reference to Issues or PRs

Comments

Additional Changes

PR Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SvenKlaassen commented Apr 30, 2024

Uh oh!

mychaelka commented May 2, 2024

Uh oh!

SvenKlaassen commented May 2, 2024

Uh oh!

Uh oh!

Uh oh!

SvenKlaassen commented Mar 28, 2024 •

edited

Loading