apacker
diff --git a/‎introduction_to_amazon_algorithms/README.md
Lines changed: 5 additions & 4 deletions b/‎introduction_to_amazon_algorithms/README.md
Lines changed: 5 additions & 4 deletions
diff --git a/‎lda_topic_modeling/LDA - Rosetta Stone.ipynb renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/LDA-Introduction.ipynb
Lines changed: 290 additions & 260 deletions b/‎lda_topic_modeling/LDA - Rosetta Stone.ipynb renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/LDA-Introduction.ipynb
Lines changed: 290 additions & 260 deletions
diff --git a/‎lda_topic_modeling/README.md renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/README.md
Lines changed: 3 additions & 9 deletions b/‎lda_topic_modeling/README.md renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/README.md
Lines changed: 3 additions & 9 deletions
diff --git a/‎lda_topic_modeling/generate_example_data.py renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/generate_example_data.py
Lines changed: 53 additions & 11 deletions b/‎lda_topic_modeling/generate_example_data.py renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/generate_example_data.py
Lines changed: 53 additions & 11 deletions
diff --git a/‎lda_topic_modeling/img/img_documents.png renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/img/img_documents.png b/‎lda_topic_modeling/img/img_documents.png renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/img/img_documents.png
diff --git a/‎lda_topic_modeling/img/img_topics.png renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/img/img_topics.png b/‎lda_topic_modeling/img/img_topics.png renamed to ‎introduction_to_amazon_algorithms/lda_topic_modeling/img/img_topics.png
@@ -3,10 +3,11 @@
 This directory includes introductory examples to Amazon SageMaker Algorithms that we have developed so far.  It seeks to provide guidance and examples on basic functionality rather than a detailed scientific review or an implementation on complex, real-world data.
 
 Example Notebooks include:
-- *linear_mnist*: Predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Linear Learner.
 - *factorization_machines_mnist*: Predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Factorization Machines.
-- *pca_mnist*: Uses Amazon SageMaker Principal Components Analysis (PCA) to calculate eigendigits from MNIST.
+- *lda_topic_modeling*: Topic modeling using Amazon SageMaker Latent Dirichlet Allocation (LDA) on a synthetic dataset.
+- *linear_mnist*: Predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Linear Learner.
 - *ntm_synthetic*: Uses Amazon SageMaker Neural Topic Model (NTM) to uncover topics in documents from a synthetic data source, where topic distributions are known.
-- *xgboost_mnist*: Uses Amazon SageMaker XGBoost to classifiy handwritten digits from the MNIST dataset into one of the ten digits using a multi-class classifier. Both single machine and distributed use-cases are presented.
-- *xgboost_abalone*: Predicts the age of abalone ([Abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html)) using regression from Amazon SageMaker XGBoost.
+- *pca_mnist*: Uses Amazon SageMaker Principal Components Analysis (PCA) to calculate eigendigits from MNIST.
 - *seq2seq*: Seq2Seq algorithm is built on top of [Sockeye](https://github.com/awslabs/sockeye), a sequence-to-sequence framework for Neural Machine Translation based on MXNet. SageMaker Seq2Seq implements state-of-the-art encoder-decoder architectures which can also be used for tasks like Abstractive Summarization in addition to Machine Translation.
+- *xgboost_abalone*: Predicts the age of abalone ([Abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html)) using regression from Amazon SageMaker XGBoost.
+- *xgboost_mnist*: Uses Amazon SageMaker XGBoost to classifiy handwritten digits from the MNIST dataset into one of the ten digits using a multi-class classifier. Both single machine and distributed use-cases are presented.
@@ -1,18 +1,12 @@
 # Latent Dirichlet Allocation and Topic Modeling
 
-Example notebooks on using Amazon SageMaker to train and use LDA models.
+An introductory notebook on using Amazon SageMaker to train and use LDA models.
 
 <p align="center">
-<img src="https://github.com/awslabs/im-notebook-templates/blob/lda_topic_modeling/lda_topic_modeling/img/img_documents.png">
-<img src="https://github.com/awslabs/im-notebook-templates/blob/lda_topic_modeling/lda_topic_modeling/img/img_topics.png">
+<img src="https://github.com/awslabs/amazon-sagemaker-examples/blob/lda_topic_modeling/introduction_to_amazon_algorithms/lda_topic_modeling/img/img_documents.png">
+<img src="https://github.com/awslabs/amazon-sagemaker-examples/blob/lda_topic_modeling/introduction_to_amazon_algorithms/lda_topic_modeling/img/img_topics.png">
 </p>
 
-* **LDA - Rosetta Stone** - An end-to-end example of generating training data,
-  uploading to an S3 bucket, training an LDA model, turning the model into an
-  endpoint, and inferring topic mixtures using the endpoint.
-* **LDA - Science** - A deep dive into the science of LDA using Amazon
-  SageMaker.
-
 ## References
 
 The example used in these notebooks come from the following paper:
 
@@ -1,10 +1,20 @@
+# Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at
+#
+#    http://aws.amazon.com/apache2.0/
+#
+# or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
+
 import matplotlib
 import matplotlib.pyplot as plt
 import matplotlib.cm as cm
 import numpy as np
 import scipy as sp
 import scipy.stats
 
+from matplotlib.gridspec import GridSpec, GridSpecFromSubplotSpec
+
 def generate_griffiths_data(num_documents=5000, average_document_length=150,
                             num_topics=5, alpha=None, eta=None, seed=0):
     """Returns example documents from Griffiths-Steyvers [1].
@@ -46,7 +56,7 @@ def generate_griffiths_data(num_documents=5000, average_document_length=150,
     theta : Numpy NDArray
         A matrix of size `num_documents` x `num_topics` equal to the topic
         mixtures used to generate the output `documents`.
-    
+
     References
     ----------
     [1] Thomas L Griffiths and Mark Steyvers. "Finding Scientific Topics."
@@ -56,7 +66,7 @@ def generate_griffiths_data(num_documents=5000, average_document_length=150,
     """
     vocabulary_size = 25
     image_dim = np.int(np.sqrt(vocabulary_size))
-    
+
     # perform checks on input
     assert num_topics in [5,10], 'Example data only available for 5 or 10 topics'
     if alpha:
@@ -75,7 +85,7 @@ def generate_griffiths_data(num_documents=5000, average_document_length=150,
     dirichlet_eta = sp.stats.dirichlet(eta)
 
     # initialize a known topic-word distribution (beta) using eta. these are
-    # the "row" and "column" topics, respectively. when num_topics = 5 only 
+    # the "row" and "column" topics, respectively. when num_topics = 5 only
     # create the col topics. when num_topics = 10 add the row topics as well
     #
     beta = np.zeros((num_topics,image_dim,image_dim), dtype=np.float)
@@ -111,22 +121,22 @@ def plot_lda(data, nrows, ncols, with_colorbar=True, cmap=cm.viridis):
     fig, ax = plt.subplots(nrows, ncols, figsize=(ncols,nrows))
     vmin = 0
     vmax = data.max()
-    
+
     V = len(data[0])
     n = int(np.sqrt(V))
     for i in range(nrows):
         for j in range(ncols):
             index = i*ncols + j
-            
+
             if nrows > 1:
                 im = ax[i,j].matshow(data[index].reshape(n,n), cmap=cmap, vmin=vmin, vmax=vmax)
             else:
                 im = ax[j].matshow(data[index].reshape(n,n), cmap=cmap, vmin=vmin, vmax=vmax)
-                
+
     for axi in ax.ravel():
         axi.set_xticks([])
         axi.set_yticks([])
-        
+
     if with_colorbar:
         fig.colorbar(im, ax=ax.ravel().tolist(), orientation='horizontal', fraction=0.2)
     return fig
@@ -136,18 +146,50 @@ def match_estimated_topics(topics_known, topics_estimated):
     K, V = topics_known.shape
     permutation = -1*np.ones(K, dtype=np.int)
     unmatched_estimated_topics = []
-    
+
     for estimated_topic_index, t in enumerate(topics_estimated):
         matched_known_topic_index = np.argmin([np.linalg.norm(known_topic - t) for known_topic in topics_known])
         if permutation[matched_known_topic_index] == -1:
             permutation[matched_known_topic_index] = estimated_topic_index
         else:
             unmatched_estimated_topics.append(estimated_topic_index)
-            
+
     for estimated_topic_index in unmatched_estimated_topics:
         for i in range(K):
             if permutation[i] == -1:
                 permutation[i] = estimated_topic_index
                 break
-                
-    return permutation, (topics_estimated[permutation,:]).copy()
+
+    return permutation, (topics_estimated[permutation,:]).copy()
+
+def _document_with_topic(fig, gsi, index, document, topic_mixture=None,
+                         vmin=0, vmax=32):
+    ax_doc = fig.add_subplot(gsi[:5,:])
+    ax_doc.matshow(document.reshape(5,5), cmap='gray_r',
+                   vmin=vmin, vmax=vmax)
+    ax_doc.set_xticks([])
+    ax_doc.set_yticks([])
+
+    if topic_mixture is not None:
+        ax_topic = plt.subplot(gsi[-1,:])
+        ax_topic.matshow(topic_mixture.reshape(1,-1), cmap='Reds',
+                         vmin=0, vmax=1)
+        ax_topic.set_xticks([])
+        ax_topic.set_yticks([])
+
+def plot_lda_topics(documents, nrows, ncols, with_colorbar=True,
+                    topic_mixtures=None, cmap='Viridis', dpi=160):
+    fig = plt.figure()
+    gs = GridSpec(nrows, ncols)
+
+    vmin, vmax = (0, documents.max())
+
+    for i in range(nrows):
+        for j in range(ncols):
+            index = i*ncols + j
+            gsi = GridSpecFromSubplotSpec(6, 5, subplot_spec=gs[i,j])
+            _document_with_topic(fig, gsi, index, documents[index],
+                                 topic_mixture=topic_mixtures[index],
+                                 vmin=vmin, vmax=vmax)
+
+    return fig