Skip to content

Added notebook and data for LGCP tutorial #4087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 13, 2020
Merged

Added notebook and data for LGCP tutorial #4087

merged 6 commits into from
Sep 13, 2020

Conversation

ckrapu
Copy link
Contributor

@ckrapu ckrapu commented Sep 8, 2020

This PR refers to issue #4084 and adds a new notebook showing how to fit a log Gaussian Cox process using a discrete grid approximation. It also includes an extension to model marked points. A small dataset is included as well.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Sep 8, 2020

Codecov Report

Merging #4087 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4087   +/-   ##
=======================================
  Coverage   88.48%   88.48%           
=======================================
  Files          88       88           
  Lines       13865    13865           
=======================================
  Hits        12268    12268           
  Misses       1597     1597           

@junpenglao
Copy link
Member

LGTM overall, some nitpicks: could you change the nested for-loop count with a plt.hist2d? it should simplify the code a bit.

@AlexAndorra AlexAndorra linked an issue Sep 8, 2020 that may be closed by this pull request
@ckrapu
Copy link
Contributor Author

ckrapu commented Sep 9, 2020

LGTM overall, some nitpicks: could you change the nested for-loop count with a plt.hist2d? it should simplify the code a bit.

I didn't realize there was an easy way to do it in 2D. Good idea! I've made the update.

Copy link
Contributor

@AlexAndorra AlexAndorra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super nice advanced use-case of GPs, thanks @ckrapu !
I think there are just a few tweaks or precisions missing, as I wrote below. It shouldn't be too long to add and feel free to ask if anything is unclear 😉

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 9, 2020

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:23Z
----------------------------------------------------------------

Typo: "In conjunction with a Bayesian analysis, can be used to..." -- word missing?


ckrapu commented on 2020-09-10T00:20:16Z
----------------------------------------------------------------

Yeah, I missed the rest of that statement.

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 9, 2020

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:24Z
----------------------------------------------------------------

I think you can use more regularizing priors to improve sampling efficiency -- this is especially true for GP hyperparameters.

pm.Normal("lam_mean", sd=10) and pm.Uniform("length_scale", lower=0.1, upper=280) in particular are super wide


ckrapu commented on 2020-09-10T00:21:58Z
----------------------------------------------------------------

Agreed - now that I think about it, a SD of 10 on log scale is way too wide. I'm not sure how to pick the length scale, though. I've tried using HalfNormal(sd=100) and various settings for an inverse-Gamm prior. None of those seemed to help cut down on the convergences. The rule of thumb I was taught was to make sure that the upper bound is less than the largest interpoint distance. Any ideas here?

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 9, 2020

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:25Z
----------------------------------------------------------------

If you use return_inferencedata=True in pm.sample above, you don't need the model context here, and it'll make the calls to ArviZ functions drastically faster (pm.summary is actually just calling az.summary, so you can also call it directly).


@review-notebook-app
Copy link

review-notebook-app bot commented Sep 9, 2020

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:25Z
----------------------------------------------------------------

vars is deprecated, please use var_names


@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:26Z
----------------------------------------------------------------

I think this can be fixed by more regularizing priors upstream


@review-notebook-app
Copy link

review-notebook-app bot commented Sep 9, 2020

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:27Z
----------------------------------------------------------------

You can get rid of plt.tight_layout() and add constrained_layout==True in plt.subplots(2, 3, figsize=(8, 5)) instead.

Also, add a ";" at the end of the last line of the cell to avoid matplotlib object printing in notebooks: plt.figlegend(...);


ckrapu commented on 2020-09-10T05:47:36Z
----------------------------------------------------------------

Got it!

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 9, 2020

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:28Z
----------------------------------------------------------------

Same comments on priors


@review-notebook-app
Copy link

review-notebook-app bot commented Sep 9, 2020

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:29Z
----------------------------------------------------------------

I think the last two lines diserve a bit more explanation, as this is not really a standard prior: IIUC, you're modeling the variation around marks more than marks themselves? If yes, explaining why would be interesting.

Also, a line about what observed=mark_residuals is doing would be valuable for readers, as this is an advanced (and somewhat confusing) use of observed (mark_residuals are not actually observed).


ckrapu commented on 2020-09-10T05:48:25Z
----------------------------------------------------------------

This part didn't have to be expressed in a roundabout way; I cleaned that up.

AlexAndorra commented on 2020-09-10T08:06:25Z
----------------------------------------------------------------

Much clearer indeed, thanks!

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 9, 2020

View / edit / reply to this conversation on ReviewNB

AlexAndorra commented on 2020-09-09T10:14:30Z
----------------------------------------------------------------

Same comment for model context


Copy link
Contributor Author

ckrapu commented Sep 10, 2020

Yeah, I missed the rest of that statement.


View entire conversation on ReviewNB

Copy link
Contributor Author

ckrapu commented Sep 10, 2020

Agreed - now that I think about it, a SD of 10 on log scale is way too wide. I'm not sure how to pick the length scale, though. I've tried using HalfNormal(sd=100) and various settings for an inverse-Gamm prior. None of those seemed to help cut down on the convergences. The rule of thumb I was taught was to make sure that the upper bound is less than the largest interpoint distance. Any ideas here?


View entire conversation on ReviewNB

Copy link
Contributor Author

ckrapu commented Sep 10, 2020

Got it!


View entire conversation on ReviewNB

Copy link
Contributor Author

ckrapu commented Sep 10, 2020

This part didn't have to be expressed in a roundabout way; I cleaned that up.


View entire conversation on ReviewNB

Copy link
Contributor

Much clearer indeed, thanks!


View entire conversation on ReviewNB

@AlexAndorra
Copy link
Contributor

Thanks for the changes @ckrapu, this looks really neat now!

I'm not sure how to pick the length scale, though. I've tried using HalfNormal(sd=100) and various settings for an inverse-Gamm prior. None of those seemed to help cut down on the convergences. Any ideas here?

Yeah, that's always (very) tricky... HalfNormal(sd=100) seems already very wide, given that you exponentiate it. I think going with Gamma or InvGamme is indeed much safer, in addition to giving you more leeway to incorporate domain knowledge. I'm curious about any advice on this from the very knowledgeable @fonnesbeck and @bwengals !

Prior pred checks with spaghetti plots and ribbon plots can also be useful -- I'm doing exactly that in a NB right now; the code is still very messy and WIP, but hopefully this can help you 🤷‍♂️ There are also prior choices in the Mauna Loa NB.
Notice though that your changes already sped up the sampling by a factor of 2+!

The rule of thumb I was taught was to make sure that the upper bound is less than the largest interpoint distance.

Yeah but even that isn't that great IIUC, as 1) it picks priors based on the observed data, and 2) maybe future data can inform larger interpoint distance.

@MarcoGorelli
Copy link
Contributor

Hi @ckrapu - a new CI check has been added, before your next commit could you fetch and merge upstream/master? I'm pretty sure it will still pass, this is just to be sure

@ckrapu
Copy link
Contributor Author

ckrapu commented Sep 10, 2020

Thanks for the changes @ckrapu, this looks really neat now!

I'm not sure how to pick the length scale, though. I've tried using HalfNormal(sd=100) and various settings for an inverse-Gamm prior. None of those seemed to help cut down on the convergences. Any ideas here?

Yeah, that's always (very) tricky... HalfNormal(sd=100) seems already very wide, given that you exponentiate it. I think going with Gamma or InvGamme is indeed much safer, in addition to giving you more leeway to incorporate domain knowledge. I'm curious about any advice on this from the very knowledgeable @fonnesbeck and @bwengals !

Prior pred checks with spaghetti plots and ribbon plots can also be useful -- I'm doing exactly that in a NB right now; the code is still very messy and WIP, but hopefully this can help you 🤷‍♂️ There are also prior choices in the Mauna Loa NB.
Notice though that your changes already sped up the sampling by a factor of 2+!

The rule of thumb I was taught was to make sure that the upper bound is less than the largest interpoint distance.

Yeah but even that isn't that great IIUC, as 1) it picks priors based on the observed data, and 2) maybe future data can inform larger interpoint distance.

I spoke too soon about the priors. Those changes helped a ton with the marked model as you pointed out.

@ckrapu
Copy link
Contributor Author

ckrapu commented Sep 10, 2020

Hi @ckrapu - a new CI check has been added, before your next commit could you fetch and merge upstream/master? I'm pretty sure it will still pass, this is just to be sure

Yup! Done.

@MarcoGorelli
Copy link
Contributor

@ckrapu looks like some of the imports are out of order, can you make the following change please?

+    "from itertools import product\n",
+    "\n",
     "import arviz as az\n",
     "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
     "import pandas as pd\n",
-    "import pymc3 as pm\n",
-    "\n",
-    "from itertools import product"
+    "import pymc3 as pm"

?

Or, just run isort via nbqa from the command-line, see the style guide:

nbqa docs/source/notebooks/log-gaussian-cox-process.ipynb --treat-comment-as-code '# %%' --lines-between-types 1 --nbqa-mutate

@AlexAndorra
Copy link
Contributor

@MarcoGorelli is there a way to enforce this automatically in the GH action when the committer hasn't run nbqa (or, later, Black) on the NB?

@MarcoGorelli
Copy link
Contributor

@MarcoGorelli is there a way to enforce this automatically in the GH action when the committer hasn't run nbqa (or, later, Black) on the NB?

As in, can we get the GitHub Action to push to the file? Maybe...I know that pre-commit's author is working on a paid product that would allow for that, but there may be a way to do this without it, I'll look into it

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Sep 11, 2020

@MarcoGorelli is there a way to enforce this automatically in the GH action when the committer hasn't run nbqa (or, later, Black) on the NB?

As in, can we get the GitHub Action to push to the file? Maybe...I know that pre-commit's author is working on a paid product that would allow for that, but there may be a way to do this without it, I'll look into it

@AlexAndorra looks like it's not ready yet https://pre-commit.ci/ ...hopefully it'll be available for free for open source projects. In the meantime I guess we can ask contributors to run this themselves, or a maintainer can add a commit to their branch

@AlexAndorra
Copy link
Contributor

As in, can we get the GitHub Action to push to the file?

Yes, as in: contributor pushes to PR --> GH action checking imports/Black on NB fails --> GH action runs nbqa itself and then pushes the changes automatically. In other words, all the formatting / style is taken care of automatically 🎉

looks like it's not ready yet https://pre-commit.ci/ ...hopefully it'll be available for free for open source projects

That'd be so awesome!

In the meantime I guess we can ask contributors to run this themselves

Definitely -- that's already what we do so it's actually no change

@ckrapu
Copy link
Contributor Author

ckrapu commented Sep 12, 2020

The import order is fixed. Should we assume that the import order desired by nbqa is the default moving forward?

@MarcoGorelli
Copy link
Contributor

Thanks for updating!

the import order desired by nbqa

just to be clear - it's isort that determines the import order, nbqa is just a little adapter which will let it run on Jupyter Notebooks (instead of Python files)

@AlexAndorra
Copy link
Contributor

Should we assume that the import order desired by nbqa is the default moving forward?

Yep, but this was already the case (see the style guide) -- the only difference is that now it's checked automatically in CI 😉

Copy link
Contributor

@AlexAndorra AlexAndorra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging now, as we gave it a few days 🎉 Thanks @ckrapu for this nice addition to the docs!

@AlexAndorra AlexAndorra merged commit 04ebd59 into pymc-devs:master Sep 13, 2020
@ckrapu
Copy link
Contributor Author

ckrapu commented Oct 27, 2020

This may be a dumb question, but I don't see a thumbnail for this notebook in the gallery. Did I forget to set something with this PR?

@twiecki
Copy link
Member

twiecki commented Oct 27, 2020

@ckrapu it should be the last plot of the NB. @ColCarroll?

@MarcoGorelli
Copy link
Contributor

@ckrapu are you referring to the "Gaussian Process" section here https://docs.pymc.io/nb_examples/index.html ?

If so, it seems that "conditional-autoregressive-model" is also missing

@ckrapu
Copy link
Contributor Author

ckrapu commented Oct 28, 2020

Yup - I thought that putting its name in docs/source/notebooks/table_of_contents_examples.js would automatically place it in the gallery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding doc notebook for log Gaussian Cox process
5 participants