|
27 | 27 | "source": [
|
28 | 28 | "## Introduction\n",
|
29 | 29 | "\n",
|
30 |
| - "Welcome to our example introducing Amazon SageMaker's PCA Algorithm! Today, we're analyzing the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset which consists of images of handwritten digits, from zero to nine. We'll ignore the true labels for the time being and instead focus on what information we can obtain from the image pixels along.\n", |
| 30 | + "Welcome to our example introducing Amazon SageMaker's PCA Algorithm! Today, we're analyzing the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset which consists of images of handwritten digits, from zero to nine. We'll ignore the true labels for the time being and instead focus on what information we can obtain from the image pixels alone.\n", |
31 | 31 | "\n",
|
32 |
| - "The method that we'll look at today is called Principal Components Analysis (PCA). PCA is an unsupervised learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.\n", |
| 32 | + "The method that we'll look at today is called Principal Components Analysis (PCA). PCA is an unsupervised learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of feature dimensions called principal components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.\n", |
33 | 33 | "\n",
|
34 |
| - "PCA is most commonly used as a pre-processing step. Statistically, many models assume low dimensional data and high dimensional noise. In those cases, the output of PCA will actually include much less of the noise and subsequent models can be more accurate. Taking datasets with a huge number of features and reducing them down can be shown to not hurt the accuracy of the clustering while enjoying significantly improved performance. In addition, using PCA in advance of a linear model can make overfitting due to multi-collinearity less likely.\n", |
| 34 | + "PCA is most commonly used as a pre-processing step. Statistically, many models assume data to be low-dimensional. In those cases, the output of PCA will actually include much less of the noise and subsequent models can be more accurate. Taking datasets with a huge number of features and reducing them down can be shown to not hurt the accuracy of the clustering while enjoying significantly improved performance. In addition, using PCA in advance of a linear model can make overfitting due to multi-collinearity less likely.\n", |
35 | 35 | "\n",
|
36 | 36 | "For our current use case though, we focus purely on the output of PCA. [Eigenfaces](https://en.wikipedia.org/wiki/Eigenface) have been used for years in facial recognition and computer vision. The eerie images represent a large library of photos as a smaller subset. These eigenfaces are not necessarily clusters, but instead highlight key features, that when combined, can represent most of the variation in faces throughout the entire library. We'll follow an analagous path and develop eigendigits from our handwritten digit dataset.\n",
|
37 | 37 | "\n",
|
|
79 | 79 | "source": [
|
80 | 80 | "### Data ingestion\n",
|
81 | 81 | "\n",
|
82 |
| - "Next, we read the dataset from an online URL into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets." |
| 82 | + "Next, we read the dataset from an online URL into memory, for preprocessing prior to training. This processing could be done *in-situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present at the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets such as this one, reading into memory isn't onerous, though it would be for larger datasets." |
83 | 83 | ]
|
84 | 84 | },
|
85 | 85 | {
|
|
0 commit comments