Merge pull request #148 from StochasticTree/docs-update-0.1.1

andrewherren · web-flow · commit 072bf44410c3 · 2025-02-12T19:40:53.000-06:00
Updating python package and docs
diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,7 @@
+# stochtree 0.1.1
+
+* Fixed initialization bug in several R package code examples for random effects models
+
 # stochtree 0.1.0
 
 * Initial release on CRAN.
diff --git a/README.md b/README.md
@@ -90,7 +90,13 @@ pip install matplotlib seaborn jupyterlab
 
 # R Package
 
-The package can be installed in R via
+The R package can be installed from CRAN via
+
+```
+install.packages("stochtree")
+```
+
+The development version of `stochtree` can be installed from Github via
 
 ```
 remotes::install_github("StochasticTree/stochtree", ref="r-dev")
diff --git a/demo/notebooks/causal_inference.ipynb b/demo/notebooks/causal_inference.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Causal Inference Demo Notebook"
+    "# Causal Inference"
    ]
   },
   {
diff --git a/demo/notebooks/causal_inference_feature_subsets.ipynb b/demo/notebooks/causal_inference_feature_subsets.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Causal Inference with Feature Subsets Demo Notebook\n",
+    "# Causal Inference with Feature Subsets\n",
     "\n",
     "This is a duplicate of the main causal inference demo which shows how a user might decide to use only a subset of covariates in the treatment effect forest. \n",
     "Why might we want to do that? Well, in many cases it is plausible that some covariates (for example age, income, etc...) influence the outcome of interest \n",
diff --git a/demo/notebooks/heteroskedastic_supervised_learning.ipynb b/demo/notebooks/heteroskedastic_supervised_learning.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Supervised Learning with Heteroskedasticity Demo Notebook"
+    "# Heteroskedastic Supervised Learning"
    ]
   },
   {
@@ -118,13 +118,6 @@
     "s_x_test = s_x[test_inds]\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Demo 1: Using `W` in a linear leaf regression"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -139,9 +132,12 @@
    "outputs": [],
    "source": [
     "bart_model = BARTModel()\n",
-    "bart_params = {'num_trees_mean': 100, 'num_trees_variance': 50, 'sample_sigma_global': True, 'sample_sigma_leaf': False}\n",
+    "global_params = {'sample_sigma2_global': True}\n",
+    "mean_params = {'num_trees': 100, 'sample_sigma2_leaf': False}\n",
+    "variance_params = {'num_trees': 50}\n",
     "bart_model.sample(X_train=X_train, y_train=y_train, X_test=X_test, basis_train=basis_train, basis_test=basis_test,\n",
-    "                  num_gfr=10, num_mcmc=100, params=bart_params)"
+    "                  num_gfr=10, num_mcmc=100, general_params=global_params, mean_forest_params=mean_params, \n",
+    "                  variance_forest_params=variance_params)"
    ]
   },
   {
@@ -171,7 +167,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "forest_preds_s_x_mcmc = bart_model.sigma_x_test\n",
+    "forest_preds_s_x_mcmc = np.sqrt(bart_model.sigma2_x_test)\n",
     "s_x_avg_mcmc = np.squeeze(forest_preds_s_x_mcmc).mean(axis = 1, keepdims = True)\n",
     "s_x_df_mcmc = pd.DataFrame(np.concatenate((np.expand_dims(s_x_test,1), s_x_avg_mcmc), axis = 1), columns=[\"True standard deviation\", \"Average estimated standard deviation\"])\n",
     "sns.scatterplot(data=s_x_df_mcmc, x=\"Average estimated standard deviation\", y=\"True standard deviation\")\n",
diff --git a/demo/notebooks/multivariate_treatment_causal_inference.ipynb b/demo/notebooks/multivariate_treatment_causal_inference.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Causal Inference with Multivariate Treatments Demo Notebook"
+    "# Multivariate Treatment Causal Inference"
    ]
   },
   {
@@ -45,7 +45,7 @@
     "rng = np.random.default_rng()\n",
     "\n",
     "# Generate covariates and basis\n",
-    "n = 5000\n",
+    "n = 500\n",
     "p_X = 5\n",
     "X = rng.uniform(0, 1, (n, p_X))\n",
     "pi_X = np.c_[0.25 + 0.5*X[:,0], 0.75 - 0.5*X[:,1]]\n",
diff --git a/demo/notebooks/prototype_interface.ipynb b/demo/notebooks/prototype_interface.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Demo of the `StochTree` Prototype Interface"
+    "# Low-Level Interface"
    ]
   },
   {
@@ -106,7 +106,7 @@
     "rng = np.random.default_rng(random_seed)\n",
     "\n",
     "# Generate covariates and basis\n",
-    "n = 1000\n",
+    "n = 500\n",
     "p_X = 10\n",
     "p_W = 1\n",
     "X = rng.uniform(0, 1, (n, p_X))\n",
@@ -383,14 +383,14 @@
     "rng = np.random.default_rng(random_seed)\n",
     "\n",
     "# Generate covariates and basis\n",
-    "n = 1000\n",
+    "n = 500\n",
     "p_X = 5\n",
     "X = rng.uniform(0, 1, (n, p_X))\n",
-    "pi_X = 0.25 + 0.5*X[:,0]\n",
+    "pi_X = 0.35 + 0.3*X[:,0]\n",
     "Z = rng.binomial(1, pi_X, n).astype(float)\n",
     "\n",
     "# Define the outcome mean functions (prognostic and treatment effects)\n",
-    "mu_X = pi_X*5\n",
+    "mu_X = (pi_X - 0.5)*30\n",
     "# tau_X = np.sin(X[:,1]*2*np.pi)\n",
     "tau_X = X[:,1]*2\n",
     "\n",
@@ -423,24 +423,24 @@
     "min_samples_leaf_mu = 1\n",
     "num_trees_mu = 200\n",
     "cutpoint_grid_size_mu = 100\n",
-    "tau_init_mu = 1/200\n",
+    "tau_init_mu = 1/num_trees_mu\n",
     "leaf_prior_scale_mu = np.array([[tau_init_mu]], order='C')\n",
     "a_leaf_mu = 3.\n",
-    "b_leaf_mu = 1/200\n",
+    "b_leaf_mu = 1/num_trees_mu\n",
     "leaf_regression_mu = False\n",
     "feature_types_mu = np.repeat(0, p_X).astype(int) # 0 = numeric\n",
     "var_weights_mu = np.repeat(1/(p_X + 1), p_X + 1)\n",
     "\n",
     "# Treatment forest parameters\n",
-    "alpha_tau = 0.25\n",
+    "alpha_tau = 0.75\n",
     "beta_tau = 3.\n",
     "min_samples_leaf_tau = 1\n",
     "num_trees_tau = 50\n",
     "cutpoint_grid_size_tau = 100\n",
-    "tau_init_tau = 1/50\n",
+    "tau_init_tau = 1/num_trees_tau\n",
     "leaf_prior_scale_tau = np.array([[tau_init_tau]], order='C')\n",
     "a_leaf_tau = 3.\n",
-    "b_leaf_tau = 1/50\n",
+    "b_leaf_tau = 1/num_trees_tau\n",
     "leaf_regression_tau = True\n",
     "feature_types_tau = np.repeat(0, p_X).astype(int) # 0 = numeric\n",
     "var_weights_tau = np.repeat(1/p_X, p_X)\n",
@@ -466,7 +466,7 @@
    "source": [
     "# Prognostic Forest Dataset (covariates)\n",
     "dataset_mu = Dataset()\n",
-    "dataset_mu.add_covariates(np.c_[X,pi_X])\n",
+    "dataset_mu.add_covariates(np.c_[X, pi_X])\n",
     "\n",
     "# Treatment Forest Dataset (covariates and treatment variable)\n",
     "dataset_tau = Dataset()\n",
@@ -521,7 +521,7 @@
    "outputs": [],
    "source": [
     "num_warmstart = 10\n",
-    "num_mcmc = 500\n",
+    "num_mcmc = 100\n",
     "num_samples = num_warmstart + num_mcmc\n",
     "global_var_samples = np.concatenate((np.array([global_variance_init]), np.repeat(0, num_samples)))\n",
     "leaf_scale_samples_mu = np.concatenate((np.array([tau_init_mu]), np.repeat(0, num_samples)))\n",
@@ -562,8 +562,6 @@
     "  forest_sampler_tau.sample_one_iteration(forest_container_tau, active_forest_tau, dataset_tau, residual, cpp_rng, \n",
     "                                          feature_types_tau, cutpoint_grid_size_tau, leaf_prior_scale_tau, var_weights_tau, \n",
     "                                          0.0, 0.0, global_var_samples[i], 1, True, True, False)\n",
-    "  # leaf_scale_samples_tau[i+1] = leaf_var_model_tau.sample_one_iteration(forest_container_tau, cpp_rng, a_leaf_tau, b_leaf_tau)\n",
-    "  # leaf_prior_scale_tau[0,0] = leaf_scale_samples_tau[i+1]\n",
     "  tau_x = np.squeeze(active_forest_tau.predict_raw(dataset_tau))\n",
     "  s_tt0 = np.sum(tau_x*tau_x*(Z==0))\n",
     "  s_tt1 = np.sum(tau_x*tau_x*(Z==1))\n",
@@ -606,8 +604,6 @@
     "  forest_sampler_tau.sample_one_iteration(forest_container_tau, active_forest_tau, dataset_tau, residual, cpp_rng, \n",
     "                                          feature_types_tau, cutpoint_grid_size_tau, leaf_prior_scale_tau, var_weights_tau, \n",
     "                                          0.0, 0.0, global_var_samples[i], 1, True, False, False)\n",
-    "  # leaf_scale_samples_tau[i+1] = leaf_var_model_tau.sample_one_iteration(forest_container_tau, cpp_rng, a_leaf_tau, b_leaf_tau, i)\n",
-    "  # leaf_prior_scale_tau[0,0] = leaf_scale_samples_tau[i+1]\n",
     "  tau_x = np.squeeze(active_forest_tau.predict_raw(dataset_tau))\n",
     "  s_tt0 = np.sum(tau_x*tau_x*(Z==0))\n",
     "  s_tt1 = np.sum(tau_x*tau_x*(Z==1))\n",
diff --git a/demo/notebooks/serialization.ipynb b/demo/notebooks/serialization.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Serialization Demo Notebook"
+    "# Model Serialization"
    ]
   },
   {
@@ -29,6 +29,7 @@
    "source": [
     "import json\n",
     "import numpy as np\n",
+    "import os\n",
     "import pandas as pd\n",
     "import seaborn as sns\n",
     "import matplotlib.pyplot as plt\n",
@@ -120,7 +121,7 @@
    "outputs": [],
    "source": [
     "bart_model = BARTModel()\n",
-    "bart_model.sample(X_train=X_train, y_train=y_train, basis_train=basis_train, X_test=X_test, basis_test=basis_test, num_gfr=10, num_mcmc=100)"
+    "bart_model.sample(X_train=X_train, y_train=y_train, basis_train=basis_train, X_test=X_test, basis_test=basis_test, num_gfr=10, num_mcmc=10)"
    ]
   },
   {
@@ -150,7 +151,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "sigma_df_mcmc = pd.DataFrame(np.concatenate((np.expand_dims(np.arange(bart_model.num_samples - bart_model.num_gfr),axis=1), np.expand_dims(bart_model.global_var_samples,axis=1)), axis = 1), columns=[\"Sample\", \"Sigma\"])\n",
+    "sigma_df_mcmc = pd.DataFrame(np.concatenate((np.expand_dims(np.arange(bart_model.num_samples),axis=1), np.expand_dims(bart_model.global_var_samples,axis=1)), axis = 1), columns=[\"Sample\", \"Sigma\"])\n",
     "sns.scatterplot(data=sigma_df_mcmc, x=\"Sample\", y=\"Sigma\")\n",
     "plt.show()"
    ]
@@ -321,6 +322,22 @@
     "plt.axline((0, 0), slope=1, color=\"black\", linestyle=(0, (3,3)))\n",
     "plt.show()"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Clean up JSON file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "os.remove('bart.json')"
+   ]
   }
  ],
  "metadata": {
diff --git a/demo/notebooks/supervised_learning.ipynb b/demo/notebooks/supervised_learning.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Supervised Learning Demo Notebook"
+    "# Supervised Learning"
    ]
   },
   {
diff --git a/demo/notebooks/tree_inspection.ipynb b/demo/notebooks/tree_inspection.ipynb
diff --git a/include/stochtree/leaf_model.h b/include/stochtree/leaf_model.h
@@ -239,7 +239,7 @@ namespace StochTree {
  *    \beta \sim N\left(0, \tau\right)
  *  \f]
  * 
- * Allowing for case / variance weights $w_i$ as above, we derive a reduced log marginal likelihood of 
+ * Allowing for case / variance weights \f$w_i\f$ as above, we derive a reduced log marginal likelihood of 
  * 
  *  \f[
  *    L(y) \propto \frac{1}{2} \log\left(\frac{\sigma^2}{s_{wxx,\ell} \tau + \sigma^2}\right) + \frac{\tau s_{wyx,\ell}^2}{2\sigma^2(s_{wxx,\ell} \tau + \sigma^2)}
diff --git a/include/stochtree/mainpage.h b/include/stochtree/mainpage.h
@@ -33,7 +33,7 @@
  * - <b>Leaf Model</b>: `stochtree`'s data structures are generalized to support a wide range of models, which are defined via specialized classes in the \ref leaf_model_group "leaf model layer".
  * - <b>Sampler</b>: helper functions that sample forests from training data comprise the \ref sampling_group "sampling layer" of `stochtree`.
  * 
- * \section extending-stochtree Extending `stochtree`
+ * \section extending-stochtree Extending stochtree
  * 
  * \subsection custom-leaf-models Custom Leaf Models
  * 
diff --git a/setup.py b/setup.py
@@ -36,6 +36,7 @@ def build_extension(self, ext: CMakeExtension) -> None:
 
         debug = int(os.environ.get("DEBUG", 0)) if self.debug is None else self.debug
         cfg = "Debug" if debug else "Release"
+        use_dbg = "ON" if debug else "OFF"
 
         # CMake lets you override the generator - we need to check this.
         # Can be set with Conda-Build, for example.
@@ -48,8 +49,8 @@ def build_extension(self, ext: CMakeExtension) -> None:
             f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={extdir}{os.sep}",
             f"-DPYTHON_EXECUTABLE={sys.executable}",
             f"-DCMAKE_BUILD_TYPE={cfg}",  # not used on MSVC, but no harm
-            "-DUSE_DEBUG=OFF", 
-            "-DUSE_SANITIZER=OFF", 
+            f"-DUSE_DEBUG={use_dbg}",
+            "-DUSE_SANITIZER=OFF",
             "-DBUILD_TEST=OFF", 
             "-DBUILD_DEBUG_TARGETS=OFF", 
             "-DBUILD_PYTHON=ON", 
@@ -151,7 +152,7 @@ def run(self):
 
 # The information here can also be placed in setup.cfg - better separation of
 # logic and declaration, and simpler if you include description/version in a file.
-__version__ = "0.0.1"
+__version__ = "0.1.1"
 
 setup(
     name="stochtree",

Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@`
`4`	`4`	`"cell_type": "markdown",`
`5`	`5`	`"metadata": {},`
`6`	`6`	`"source": [`
`7`		`- "# Causal Inference Demo Notebook"`
	`7`	`+ "# Causal Inference"`
`8`	`8`	`]`
`9`	`9`	`},`
`10`	`10`	`{`
Original file line number	Diff line number	Diff line change
`@@ -239,7 +239,7 @@ namespace StochTree {`
`239`	`239`	`* \beta \sim N\left(0, \tau\right)`
`240`	`240`	`* \f]`
`241`	`241`	`*`
`242`		`- * Allowing for case / variance weights $w_i$ as above, we derive a reduced log marginal likelihood of`
	`242`	`+ * Allowing for case / variance weights \f$w_i\f$ as above, we derive a reduced log marginal likelihood of`
`243`	`243`	`*`
`244`	`244`	`* \f[`
`245`	`245`	`* L(y) \propto \frac{1}{2} \log\left(\frac{\sigma^2}{s_{wxx,\ell} \tau + \sigma^2}\right) + \frac{\tau s_{wyx,\ell}^2}{2\sigma^2(s_{wxx,\ell} \tau + \sigma^2)}`
Original file line number	Diff line number	Diff line change
`@@ -33,7 +33,7 @@`
`33`	`33`	* - <b>Leaf Model</b>: `stochtree`'s data structures are generalized to support a wide range of models, which are defined via specialized classes in the \ref leaf_model_group "leaf model layer".
`34`	`34`	* - <b>Sampler</b>: helper functions that sample forests from training data comprise the \ref sampling_group "sampling layer" of `stochtree`.
`35`	`35`	`*`
`36`		- * \section extending-stochtree Extending `stochtree`
	`36`	`+ * \section extending-stochtree Extending stochtree`
`37`	`37`	`*`
`38`	`38`	`* \subsection custom-leaf-models Custom Leaf Models`
`39`	`39`	`*`