Skip to content

Implementation of sample selection estimators #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mychaelka
Copy link

@mychaelka mychaelka commented Feb 25, 2024

Description

This PR contains an implementation of two estimators of sample selection models from Michela Bia, Martin Huber & Lukáš Lafférs (2023) Double Machine Learning for Sample Selection Models, Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2023.2271071 -- identification under missingness at random and under nonignorable nonresponse, along with basic tests on simulated data. For testing, the file conftest.py was also modified to include the DGP for these models. Original implementation of these estimators is available in the R causalweight package (https://cran.r-project.org/web/packages/causalweight/index.html).

Reference to Issues or PRs

None

Comments

These estimators require a sample selection indicator to be present in the data (1 if outcome is observed, 0 otherwise). The DoubleMLData interface does not have a selection indicator available yet, so the implementation uses the time indicator t in its place. The third estimator in the paper (identification under sequential conditional independence) is not implemented yet, as it would require interfering with the implementation of DoubleMLData, as it requires the covariates to be split into two parts -- observed pre-treatment and observed post-treatment.

PR Checklist

Please fill out this PR checklist (see our contributing guidelines for details).

  • The title of the pull request summarizes the changes made.
  • The PR contains a detailed description of all changes and additions.
  • References to related issues or PRs are added.
  • The code passes all (unit) tests.
  • Enhancements or new feature are equipped with unit tests.
  • The changes adhere to the PEP8 standards.

'preds': np.full(shape=self._dml_data.n_obs, fill_value=np.nan)
}
mu_hat_d0 = copy.deepcopy(mu_hat_d1)
pi_hat = copy.deepcopy(mu_hat_d1)

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to 'pi_hat' is unnecessary as it is [redefined](1) before this value is used.
from .double_ml_score_mixins import LinearScoreMixin


class DoubleMLS(LinearScoreMixin, DoubleML):

Check warning

Code scanning / CodeQL

Conflicting attributes in base classes

Base classes have conflicting values for attribute '_score_element_names': [Property _score_element_names](1) and [Property _score_element_names](2).
draw_sample_splitting,
apply_cross_fitting)

self._external_predictions_implemented = False

Check warning

Code scanning / CodeQL

Overwriting attribute in super-class or sub-class

Assignment overwrites attribute _external_predictions_implemented, which was previously defined in superclass [DoubleML](1).
apply_cross_fitting)

self._external_predictions_implemented = False
self._sensitivity_implemented = True

Check warning

Code scanning / CodeQL

Overwriting attribute in super-class or sub-class

Assignment overwrites attribute _sensitivity_implemented, which was previously defined in superclass [DoubleML](1).
Comment on lines 151 to 154
self._learner = {'ml_mu': clone(ml_mu),
'ml_pi': clone(ml_pi),
'ml_p': clone(ml_p),
}

Check warning

Code scanning / CodeQL

Overwriting attribute in super-class or sub-class

Assignment overwrites attribute _learner, which was previously defined in superclass [DoubleML](1).
@SvenKlaassen
Copy link
Member

Thank you very much for the contribution to the package.

I think the idea with using the time variable t as a sample selection indicator is fine.
Maybe we can move the PR to a new branch and I can implement an optional argument in the DoubleMLData class, before we merge this onto main.

@SvenKlaassen SvenKlaassen changed the base branch from main to add-sample-selection-models February 29, 2024 07:25
@mychaelka
Copy link
Author

Thank you very much for the contribution to the package.

I think the idea with using the time variable t as a sample selection indicator is fine. Maybe we can move the PR to a new branch and I can implement an optional argument in the DoubleMLData class, before we merge this onto main.

Thank you Sven, there still remains one estimator in the Sample selection paper (identification under sequential conditional independence) that I have not yet implemented, as mentioned in the comments. This estimator accounts for the covariates X being observed both pre-treatment and post-treatment. Would it be possible to also add such (optional) distinction into DoubleMLData?

@SvenKlaassen
Copy link
Member

I guess it would be possible. I will check out the paper and try to come up with a solution.
Today, I did merge the PR to restructure the package. I would like to include the sample selection models in a seperate folder (like irm, did etc.). Maybe you have a good suggestion for a short name?

@mychaelka
Copy link
Author

I was actually thinking about that, since I don't really like the class name that I am using now (DoubleMLS), so maybe we could change it to DoubleMLSEL, and the short folder name could be "sel"? What do you think?

@SvenKlaassen
Copy link
Member

Both DoubleMLSEL and sel are fine for me.
Another option would be ssm for sample selection model.

I have thought a bit about the additional implementation of covariates $X$ and $M$ for identification under sequential conditional independence and think this would fit better with a larger rewrite of the DoubleMLData class.
I was already thinking about an option to distinguish between covariates which are confounders and covariates which only affect the outcome.
My sugesstion would be to focus on the implmentation of missing at random at nonignorable and in an additional branch i try to extend DoubleMLData to be able to handle multiple categories of covariates. But this might take some time.
On the branch for the current selection models i will add an option s for the sample selection indicator similar to t for DiD.

@mychaelka
Copy link
Author

ssm also sounds good. I agree that we can for now focus on the first two estimators and leave the identification under sequential conditional independence until you have time to extend DoubleMLData. I will have some time to work at it this weekend, so I will look at the issues found by CodeQL and at the open points you mentioned in email (exception tests, example notebooks, etc.)

@mychaelka
Copy link
Author

@SvenKlaassen I added model defaults and return type tests for the sample selection models into the existing files and created a new file with exception tests. I am also working on example notebooks with both simulated and real data.
I can see that there is some issue with Codacy analysis, but I'm not sure whether I am able to do something about this?

@SvenKlaassen
Copy link
Member

Sorry for the late reply.
Thank you for all the updates.
I can resolve the codacy issue later (or after we merge this onto the DoubleML branch).
Can you try to answer my comments in if you changed something. That would be a bit easier for me to retrace the changes.
Especially the comment regarding the ordering and variance estimation?
This is not changed yet right? This might also help with the results on the example notebooks.

@mychaelka
Copy link
Author

mychaelka commented Mar 15, 2024

Sorry, I couldn't see any comments on the code, nor did I receive any notification. Until now I thought you had been busy and did not have time to go through the code, so I was working on the things you mentioned in email. I still cannot see any comments on the code though...

@SvenKlaassen
Copy link
Member

I mean the comments in this PR. I am not sure why you are not able to see them.
We could also do a short call and discuss any open points.

@mychaelka
Copy link
Author

mychaelka commented Mar 15, 2024

I can only see the comments in this conversation. I tried looking under changed files and commits, but I can only see the warnings from Codacy there. We can definitely do a short call -- I am currently away but will be available today around 4pm again. I will also be available during the weekend and for the most of next week.

@SvenKlaassen
Copy link
Member

Sorry, it was completely my fault. I forgot to submit the review....
Not all comments are still valid.

If some points need discussion, we can still talk (just ask via mail).

@mychaelka
Copy link
Author

Thank you, I will try to go over the comments during this weekend. I can already see that some of the comments might be a huge help for the parts that I was struggling with a bit.

@mychaelka
Copy link
Author

@SvenKlaassen I refactored the code to use only one sample splitting procedure for the nested estimation. I also fixed the ordering -- now the order of predictions should match the input data. If you come across any other issue (or find the refactored code to still have bugs), please let me know.

Copy link
Member

@SvenKlaassen SvenKlaassen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Only very small comments.
If the models are included, we can merge this and I can then start to update the documentation.
I might also start to add external predictions and sensitivity analysis before merging in onto main.

@SvenKlaassen
Copy link
Member

I will merge the PR onto the extra branch and try to merge it onto main later.
This might take some time as I am quite busy right now and have to previously include some changes to align the implementation with the new framework setting.
I will also update the documentation, so I might ask you in the future to check if you are satisfied with the description (If you have a notebook for the example section, I will happily include that).

@SvenKlaassen SvenKlaassen marked this pull request as ready for review March 27, 2024 19:33
@SvenKlaassen SvenKlaassen merged commit 7dfeea8 into DoubleML:add-sample-selection-models Mar 28, 2024
@SvenKlaassen SvenKlaassen mentioned this pull request Mar 28, 2024
11 tasks
@mychaelka
Copy link
Author

Of course, thank you. If you need any help with the documentation or anything else, please let me know. I already have some example notebooks but I have to clean them up first (so far they have been only experimental).

@mychaelka mychaelka deleted the causalweight_impl branch March 28, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants