edit scientific_details_of_algorithms/lda_topic_modeling/LDA-Science.ipynb

atqy · atqy · commit 3d9053604829 · 2022-04-11T15:44:02.000-07:00
diff --git a/scientific_details_of_algorithms/lda_topic_modeling/LDA-Science.ipynb b/scientific_details_of_algorithms/lda_topic_modeling/LDA-Science.ipynb
@@ -18,7 +18,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Introduction\n",
+    "## Introduction\n",
     "***\n",
     "\n",
     "Amazon SageMaker LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. Latent Dirichlet Allocation (LDA) is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.\n",
@@ -70,7 +70,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Setup\n",
+    "## Setup\n",
     "\n",
     "***\n",
     "\n",
@@ -114,7 +114,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## The LDA Model\n",
+    "### The LDA Model\n",
     "\n",
     "As mentioned above, LDA is a model for discovering latent topics describing a collection of documents. In this section we will give a brief introduction to the model. Let,\n",
     "\n",
@@ -143,11 +143,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Data Exploration\n",
+    "## Data Exploration\n",
     "\n",
     "---\n",
     "\n",
-    "## An Example Dataset\n",
+    "### An Example Dataset\n",
     "\n",
     "Before explaining further let's get our hands dirty with an example dataset. The following synthetic data comes from [1] and comes with a very useful visual interpretation.\n",
     "\n",
@@ -312,7 +312,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Generating Documents\n",
+    "### Generating Documents\n",
     "\n",
     "LDA is a generative model, meaning that the LDA parameters $(\\alpha, \\beta)$ are used to construct documents word-by-word by drawing from the topic-word distributions. In fact, looking closely at the example documents above you can see that some documents sample more words from some topics than from others.\n",
     "\n",
@@ -340,7 +340,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Topic Mixtures\n",
+    "### Topic Mixtures\n",
     "\n",
     "For the documents we generated above lets look at their corresponding topic mixtures, $\\theta \\in \\mathbb{R}^K$. The topic mixtures represent the probablility that a given word of the document is sampled from a particular topic. For example, if the topic mixture of an input document $w$ is,\n",
     "\n",
@@ -446,7 +446,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Training\n",
+    "## Training\n",
     "\n",
     "***\n",
     "\n",
@@ -457,7 +457,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Topic Estimation using Tensor Decompositions\n",
+    "### Topic Estimation using Tensor Decompositions\n",
     "\n",
     "Given a document corpus, Amazon SageMaker LDA uses a spectral tensor decomposition technique to determine the LDA model $(\\alpha, \\beta)$ which most likely describes the corpus. See [1] for a primary reference of the theory behind the algorithm. The spectral decomposition, itself, is computed using the CPDecomp algorithm described in [2].\n",
     "\n",
@@ -483,7 +483,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Store Data on S3\n",
+    "### Store Data on S3\n",
     "\n",
     "Before we run training we need to prepare the data.\n",
     "\n",
@@ -534,7 +534,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Training Parameters\n",
+    "### Training Parameters\n",
     "\n",
     "Particular to a SageMaker LDA training job are the following hyperparameters:\n",
     "\n",
@@ -622,7 +622,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Inspecting the Trained Model\n",
+    "### Inspecting the Trained Model\n",
     "\n",
     "We know the LDA parameters $(\\alpha, \\beta)$ used to generate the example data. How does the learned model compare the known one? In this section we will download the model data and measure how well SageMaker LDA did in learning the model.\n",
     "\n",
@@ -706,7 +706,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Inference\n",
+    "## Inference\n",
     "\n",
     "***\n",
     "\n",
@@ -813,7 +813,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Inference Analysis\n",
+    "### Inference Analysis\n",
     "\n",
     "Recall that although SageMaker LDA successfully learned the underlying topics which generated the sample data the topics were in a different order. Before we compare to known topic mixtures $\\theta \\in \\mathbb{R}^K$ we should also permute the inferred topic mixtures\n"
    ]
@@ -1020,7 +1020,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Stop / Close the Endpoint\n",
+    "### Stop / Close the Endpoint\n",
     "\n",
     "Finally, we should delete the endpoint before we close the notebook.\n",
     "\n",
@@ -1040,7 +1040,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Epilogue\n",
+    "## Epilogue\n",
     "\n",
     "---\n",
     "\n",