Implementation of sample selection estimators #231

mychaelka · 2024-02-25T19:29:47Z

Description

This PR contains an implementation of two estimators of sample selection models from Michela Bia, Martin Huber & Lukáš Lafférs (2023) Double Machine Learning for Sample Selection Models, Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2023.2271071 -- identification under missingness at random and under nonignorable nonresponse, along with basic tests on simulated data. For testing, the file conftest.py was also modified to include the DGP for these models. Original implementation of these estimators is available in the R causalweight package (https://cran.r-project.org/web/packages/causalweight/index.html).

Reference to Issues or PRs

None

Comments

These estimators require a sample selection indicator to be present in the data (1 if outcome is observed, 0 otherwise). The DoubleMLData interface does not have a selection indicator available yet, so the implementation uses the time indicator t in its place. The third estimator in the paper (identification under sequential conditional independence) is not implemented yet, as it would require interfering with the implementation of DoubleMLData, as it requires the covariates to be split into two parts -- observed pre-treatment and observed post-treatment.

PR Checklist

Please fill out this PR checklist (see our contributing guidelines for details).

The title of the pull request summarizes the changes made.
The PR contains a detailed description of all changes and additions.
References to related issues or PRs are added.
The code passes all (unit) tests.
Enhancements or new feature are equipped with unit tests.
The changes adhere to the PEP8 standards.

…and create new clean class

…st.py

doubleml/double_ml_selection.py

+                         'preds': np.full(shape=self._dml_data.n_obs, fill_value=np.nan)
+                         }
+            mu_hat_d0 = copy.deepcopy(mu_hat_d1)
+            pi_hat = copy.deepcopy(mu_hat_d1)


doubleml/double_ml_selection.py

+from .double_ml_score_mixins import LinearScoreMixin
+
+
+class DoubleMLS(LinearScoreMixin, DoubleML):


doubleml/double_ml_selection.py

+                         draw_sample_splitting,
+                         apply_cross_fitting)
+
+        self._external_predictions_implemented = False


doubleml/double_ml_selection.py

+                         apply_cross_fitting)
+
+        self._external_predictions_implemented = False
+        self._sensitivity_implemented = True


doubleml/double_ml_selection.py

+        self._learner = {'ml_mu': clone(ml_mu),
+                         'ml_pi': clone(ml_pi),
+                         'ml_p': clone(ml_p),
+                         }


SvenKlaassen · 2024-02-26T08:34:59Z

Thank you very much for the contribution to the package.

I think the idea with using the time variable t as a sample selection indicator is fine.
Maybe we can move the PR to a new branch and I can implement an optional argument in the DoubleMLData class, before we merge this onto main.

mychaelka · 2024-02-29T18:49:55Z

Thank you very much for the contribution to the package.

I think the idea with using the time variable t as a sample selection indicator is fine. Maybe we can move the PR to a new branch and I can implement an optional argument in the DoubleMLData class, before we merge this onto main.

Thank you Sven, there still remains one estimator in the Sample selection paper (identification under sequential conditional independence) that I have not yet implemented, as mentioned in the comments. This estimator accounts for the covariates X being observed both pre-treatment and post-treatment. Would it be possible to also add such (optional) distinction into DoubleMLData?

SvenKlaassen · 2024-02-29T19:51:58Z

I guess it would be possible. I will check out the paper and try to come up with a solution.
Today, I did merge the PR to restructure the package. I would like to include the sample selection models in a seperate folder (like irm, did etc.). Maybe you have a good suggestion for a short name?

mychaelka · 2024-02-29T20:28:45Z

I was actually thinking about that, since I don't really like the class name that I am using now (DoubleMLS), so maybe we could change it to DoubleMLSEL, and the short folder name could be "sel"? What do you think?

SvenKlaassen · 2024-03-01T06:47:23Z

Both DoubleMLSEL and sel are fine for me.
Another option would be ssm for sample selection model.

I have thought a bit about the additional implementation of covariates $X$ and $M$ for identification under sequential conditional independence and think this would fit better with a larger rewrite of the DoubleMLData class.
I was already thinking about an option to distinguish between covariates which are confounders and covariates which only affect the outcome.
My sugesstion would be to focus on the implmentation of missing at random at nonignorable and in an additional branch i try to extend DoubleMLData to be able to handle multiple categories of covariates. But this might take some time.
On the branch for the current selection models i will add an option s for the sample selection indicator similar to t for DiD.

mychaelka · 2024-03-04T08:45:45Z

ssm also sounds good. I agree that we can for now focus on the first two estimators and leave the identification under sequential conditional independence until you have time to extend DoubleMLData. I will have some time to work at it this weekend, so I will look at the issues found by CodeQL and at the open points you mentioned in email (exception tests, example notebooks, etc.)

…ected

…utils_selection_manual

mychaelka · 2024-03-10T21:17:16Z

@SvenKlaassen I added model defaults and return type tests for the sample selection models into the existing files and created a new file with exception tests. I am also working on example notebooks with both simulated and real data.
I can see that there is some issue with Codacy analysis, but I'm not sure whether I am able to do something about this?

SvenKlaassen · 2024-03-15T07:50:22Z

Sorry for the late reply.
Thank you for all the updates.
I can resolve the codacy issue later (or after we merge this onto the DoubleML branch).
Can you try to answer my comments in if you changed something. That would be a bit easier for me to retrace the changes.
Especially the comment regarding the ordering and variance estimation?
This is not changed yet right? This might also help with the results on the example notebooks.

mychaelka · 2024-03-15T08:59:52Z

Sorry, I couldn't see any comments on the code, nor did I receive any notification. Until now I thought you had been busy and did not have time to go through the code, so I was working on the things you mentioned in email. I still cannot see any comments on the code though...

SvenKlaassen · 2024-03-15T11:10:06Z

I mean the comments in this PR. I am not sure why you are not able to see them.
We could also do a short call and discuss any open points.

mychaelka · 2024-03-15T11:34:52Z

I can only see the comments in this conversation. I tried looking under changed files and commits, but I can only see the warnings from Codacy there. We can definitely do a short call -- I am currently away but will be available today around 4pm again. I will also be available during the weekend and for the most of next week.

doubleml/double_ml_selection.py

SvenKlaassen · 2024-03-15T11:58:48Z

Sorry, it was completely my fault. I forgot to submit the review....
Not all comments are still valid.

If some points need discussion, we can still talk (just ask via mail).

mychaelka · 2024-03-15T12:16:05Z

Thank you, I will try to go over the comments during this weekend. I can already see that some of the comments might be a huge help for the parts that I was struggling with a bit.

…rmalization to false

…e of nonignorable nonresponse

…write tests accordingly

mychaelka · 2024-03-23T16:08:35Z

@SvenKlaassen I refactored the code to use only one sample splitting procedure for the nested estimation. I also fixed the ordering -- now the order of predictions should match the input data. If you come across any other issue (or find the refactored code to still have bugs), please let me know.

SvenKlaassen

Looks really good. Only very small comments.
If the models are included, we can merge this and I can then start to update the documentation.
I might also start to add external predictions and sensitivity analysis before merging in onto main.

doubleml/double_ml_ssm.py

SvenKlaassen · 2024-03-27T19:32:36Z

I will merge the PR onto the extra branch and try to merge it onto main later.
This might take some time as I am quite busy right now and have to previously include some changes to align the implementation with the new framework setting.
I will also update the documentation, so I might ask you in the future to check if you are satisfied with the description (If you have a notebook for the example section, I will happily include that).

mychaelka · 2024-03-28T08:05:31Z

Of course, thank you. If you need any help with the documentation or anything else, please let me know. I already have some example notebooks but I have to clean them up first (so far they have been only experimental).

Michaela Kecskésová added 18 commits February 9, 2024 10:25

Move sample selection model to double_ml_for_py from templates

7a29196

Move sample selection model to double_ml_for_py from templates

f5e2a78

Nuisance estimation for MAR sample selection

5ce921f

Add pi and p propensity scores estimation

98cde59

Fix estimation of nuisance functions

254a1dc

fix estimation error caused by S column

65ac448

Fix issues with wrong sampling

0c7705f

Start working on nonignorable nonresponse

6c0a2e5

Save before rebasing

ba22c03

Implement estimator for selection under nonignorable nonresponse

64c04f9

Fix estimation errors in MAR and nonignorable nonresponse estimators …

c443a96

…and create new clean class

Implement weights normalization and start writing tests

ce21cc0

Write initial tests for MAR

e45d1e4

Write initial tests for nonignorable nonresponse estimator

c9658c2

Remove unnecessary imports

694389f

Add DoubleMLS to __init__.py for relative imports

02029f6

Add minor comments and change sample size in simulated data in confte…

3a9be3c

…st.py

Fix formatting according to PEP8

38a65fa

github-advanced-security bot found potential problems Feb 26, 2024

View reviewed changes

SvenKlaassen changed the base branch from main to add-sample-selection-models February 29, 2024 07:25

Michaela Kecskésová added 4 commits March 7, 2024 20:21

Change model name to DoubleMLSSM and remove line redefining pi_hat

83f8143

Raise warning instead of error when instrument is present and MAR sel…

bf267fd

…ected

Create DGP for sample selection in datasets.py

c005549

Add default value tests for sample selection models

f90804d

Michaela Kecskésová added 4 commits March 10, 2024 18:29

Raise NotImplementedError for sequential conditional independence

ad19b97

Add exception tests for SSM

209662e

Change DGP for sample selection tests

1f6e341

Add binary outcome check for classifier mu and trim observations in _…

315fe8b

…utils_selection_manual

Remove unused imports and variables

93ed42e

SvenKlaassen requested changes Mar 15, 2024

View reviewed changes

Michaela Kecskésová added 5 commits March 17, 2024 11:36

Change names of nuisance functions

d6e0af7

Change score name 'mar' to 'missing-at-random' and set default ipw no…

fc3bf2d

…rmalization to false

Use _check_score from utilities and allow multiple instruments in cas…

b428015

…e of nonignorable nonresponse

Refactor to use only one splitting procedure and correct ordering, re…

6d29055

…write tests accordingly

Rename selection to ssm

0e1969b

Fix return type tests after renaming nuisance functions

fb6a38f

SvenKlaassen requested changes Mar 25, 2024

View reviewed changes

doubleml/double_ml_ssm.py Outdated Show resolved Hide resolved

doubleml/double_ml_ssm.py Outdated Show resolved Hide resolved

doubleml/double_ml_ssm.py Outdated Show resolved Hide resolved

Michaela Kecskésová added 2 commits March 27, 2024 17:21

Add .coverage

7bd83d7

Save fitted models under nonignorable nonresponse

6115f25

SvenKlaassen approved these changes Mar 27, 2024

View reviewed changes

SvenKlaassen marked this pull request as ready for review March 27, 2024 19:33

SvenKlaassen merged commit 7dfeea8 into DoubleML:add-sample-selection-models Mar 28, 2024

SvenKlaassen mentioned this pull request Mar 28, 2024

Add sample selection models #235

Merged

11 tasks

mychaelka deleted the causalweight_impl branch March 28, 2024 08:15

		from .double_ml_score_mixins import LinearScoreMixin


		class DoubleMLS(LinearScoreMixin, DoubleML):

Implementation of sample selection estimators #231

Implementation of sample selection estimators #231

Uh oh!

Conversation

mychaelka commented Feb 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reference to Issues or PRs

Comments

PR Checklist

Uh oh!

Check warning

Check warning

Check warning

Check warning

Check warning

SvenKlaassen commented Feb 26, 2024

Uh oh!

mychaelka commented Feb 29, 2024

Uh oh!

SvenKlaassen commented Feb 29, 2024

Uh oh!

mychaelka commented Feb 29, 2024

Uh oh!

SvenKlaassen commented Mar 1, 2024

Uh oh!

mychaelka commented Mar 4, 2024

Uh oh!

mychaelka commented Mar 10, 2024

Uh oh!

SvenKlaassen commented Mar 15, 2024

Uh oh!

mychaelka commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SvenKlaassen commented Mar 15, 2024

Uh oh!

mychaelka commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SvenKlaassen commented Mar 15, 2024

Uh oh!

mychaelka commented Mar 15, 2024

Uh oh!

mychaelka commented Mar 23, 2024

Uh oh!

SvenKlaassen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SvenKlaassen commented Mar 27, 2024

Uh oh!

mychaelka commented Mar 28, 2024

Uh oh!

Uh oh!

mychaelka commented Feb 25, 2024 •

edited

Loading

mychaelka commented Mar 15, 2024 •

edited

Loading

mychaelka commented Mar 15, 2024 •

edited

Loading