|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Ingest Tabular Data\n", |
| 8 | + "\n", |
| 9 | + "When ingesting structured data from an existing S3 bucket into a SageMaker Notebook, there are multiple ways to handle it. We will introduce the following methods to access your data from the notebook:\n", |
| 10 | + "\n", |
| 11 | + "* Copying your data to your instance. If you are dealing with a normal size of data or are simply experimenting, you can copy the files into the SageMaker instance and just use it as a file system in your local machine. \n", |
| 12 | + "* Using Python packages to directly access your data without copying it. One downside of copying your data to your instance is: if you are done with your notebook instance and delete it, all the data is gone with it unless you store it elsewhere. We will introduce several methods to solve this problem in this notebook, and using python packages is one of them. Also, if you have large data sets (for example, with millions of rows), you can directly read data from S3 utilizing S3 compatible python libraries with built-in functions.\n", |
| 13 | + "* Using AWS native methods to directly access your data. You can also use AWS native packages like `s3fs` and `aws data wrangler` to access your data directly. \n", |
| 14 | + "\n", |
| 15 | + "We will demonstrate how to ingest the following tabular (structured) into a notebook for further analysis:\n", |
| 16 | + "## Tabular data: Boston Housing Data\n", |
| 17 | + "The [Boston House](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. We will use the data set to showcase how to ingest tabular data into S3, and for further pre-processing and feature engineering. The dataset contains the following columns (506 rows):\n", |
| 18 | + "* `CRIM` - per capita crime rate by town\n", |
| 19 | + "* `ZN` - proportion of residential land zoned for lots over 25,000 sq.ft.\n", |
| 20 | + "* `INDUS` - proportion of non-retail business acres per town.\n", |
| 21 | + "* `CHAS` - Charles River dummy variable (1 if tract bounds river; 0 otherwise)\n", |
| 22 | + "* `NOX` - nitric oxides concentration (parts per 10 million)\n", |
| 23 | + "* `RM` - average number of rooms per dwelling\n", |
| 24 | + "* `AGE` - proportion of owner-occupied units built prior to 1940\n", |
| 25 | + "* `DIS` - weighted distances to five Boston employment centres\n", |
| 26 | + "* `RAD` - index of accessibility to radial highways\n", |
| 27 | + "* `TAX` - full-value property-tax rate per \\$10,000\n", |
| 28 | + "* `PTRATIO` - pupil-teacher ratio by town\n", |
| 29 | + "* `B` - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n", |
| 30 | + "* `LSTAT` - \\% lower status of the population" |
| 31 | + ] |
| 32 | + }, |
| 33 | + { |
| 34 | + "cell_type": "markdown", |
| 35 | + "metadata": {}, |
| 36 | + "source": [ |
| 37 | + "## Download data from online resources and write data to S3" |
| 38 | + ] |
| 39 | + }, |
| 40 | + { |
| 41 | + "cell_type": "code", |
| 42 | + "execution_count": null, |
| 43 | + "metadata": {}, |
| 44 | + "outputs": [], |
| 45 | + "source": [ |
| 46 | + "%pip install -qU 'sagemaker>=2.15.0' 's3fs==0.4.2' 'awswrangler==1.2.0'\n", |
| 47 | + "# you would need s3fs version > 0.4.0 for aws data wrangler to work correctly" |
| 48 | + ] |
| 49 | + }, |
| 50 | + { |
| 51 | + "cell_type": "code", |
| 52 | + "execution_count": null, |
| 53 | + "metadata": {}, |
| 54 | + "outputs": [], |
| 55 | + "source": [ |
| 56 | + "import awswrangler as wr\n", |
| 57 | + "import pandas as pd\n", |
| 58 | + "import s3fs\n", |
| 59 | + "import sagemaker\n", |
| 60 | + "# to load the boston housing dataset\n", |
| 61 | + "from sklearn.datasets import *" |
| 62 | + ] |
| 63 | + }, |
| 64 | + { |
| 65 | + "cell_type": "code", |
| 66 | + "execution_count": null, |
| 67 | + "metadata": {}, |
| 68 | + "outputs": [], |
| 69 | + "source": [ |
| 70 | + "# Get SageMaker session & default S3 bucket\n", |
| 71 | + "sagemaker_session = sagemaker.Session()\n", |
| 72 | + "s3 = sagemaker_session.boto_session.resource('s3')\n", |
| 73 | + "bucket = sagemaker_session.default_bucket() #replace with your own bucket name if you have one\n", |
| 74 | + "prefix = 'data/tabular/boston_house'\n", |
| 75 | + "filename = 'boston_house.csv'" |
| 76 | + ] |
| 77 | + }, |
| 78 | + { |
| 79 | + "cell_type": "code", |
| 80 | + "execution_count": null, |
| 81 | + "metadata": {}, |
| 82 | + "outputs": [], |
| 83 | + "source": [ |
| 84 | + "#helper functions to upload data to s3\n", |
| 85 | + "def write_to_s3(filename, bucket, prefix):\n", |
| 86 | + " #put one file in a separate folder. This is helpful if you read and prepare data with Athena\n", |
| 87 | + " filename_key = filename.split('.')[0]\n", |
| 88 | + " key = \"{}/{}/{}\".format(prefix,filename_key,filename)\n", |
| 89 | + " return s3.Bucket(bucket).upload_file(filename,key)\n", |
| 90 | + "\n", |
| 91 | + "def upload_to_s3(bucket, prefix, filename):\n", |
| 92 | + " url = 's3://{}/{}/{}'.format(bucket, prefix, filename)\n", |
| 93 | + " print('Writing to {}'.format(url))\n", |
| 94 | + " write_to_s3(filename, bucket, prefix)" |
| 95 | + ] |
| 96 | + }, |
| 97 | + { |
| 98 | + "cell_type": "code", |
| 99 | + "execution_count": null, |
| 100 | + "metadata": {}, |
| 101 | + "outputs": [], |
| 102 | + "source": [ |
| 103 | + "#download files from tabular data source location\n", |
| 104 | + "tabular_data = load_boston()\n", |
| 105 | + "tabular_data_full = pd.DataFrame(tabular_data.data, columns=tabular_data.feature_names)\n", |
| 106 | + "tabular_data_full['target'] = pd.DataFrame(tabular_data.target)\n", |
| 107 | + "tabular_data_full.to_csv('boston_house.csv', index = False)" |
| 108 | + ] |
| 109 | + }, |
| 110 | + { |
| 111 | + "cell_type": "code", |
| 112 | + "execution_count": null, |
| 113 | + "metadata": {}, |
| 114 | + "outputs": [], |
| 115 | + "source": [ |
| 116 | + "upload_to_s3(bucket, 'data/tabular', filename)" |
| 117 | + ] |
| 118 | + }, |
| 119 | + { |
| 120 | + "cell_type": "markdown", |
| 121 | + "metadata": {}, |
| 122 | + "source": [ |
| 123 | + "## Ingest Tabular Data from S3 bucket\n", |
| 124 | + "### Method 1: Copying data to the Instance\n", |
| 125 | + "You can use AWS Command Line Interface (CLI) to copy your data from s3 to your SageMaker instance and copy files between your S3 buckets. This is a quick and easy approach when you are dealing with medium-sized data files, or you are experimenting and doing exploratory analysis. The documentation can be found [here](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html)." |
| 126 | + ] |
| 127 | + }, |
| 128 | + { |
| 129 | + "cell_type": "code", |
| 130 | + "execution_count": null, |
| 131 | + "metadata": {}, |
| 132 | + "outputs": [], |
| 133 | + "source": [ |
| 134 | + "#copy data to your sagemaker instance using AWS CLI\n", |
| 135 | + "!aws s3 cp s3://$bucket/$prefix/ $prefix/ --recursive" |
| 136 | + ] |
| 137 | + }, |
| 138 | + { |
| 139 | + "cell_type": "code", |
| 140 | + "execution_count": null, |
| 141 | + "metadata": {}, |
| 142 | + "outputs": [], |
| 143 | + "source": [ |
| 144 | + "data_location = \"{}/{}\".format(prefix, filename)\n", |
| 145 | + "tabular_data = pd.read_csv(data_location, nrows = 5)\n", |
| 146 | + "tabular_data.head()" |
| 147 | + ] |
| 148 | + }, |
| 149 | + { |
| 150 | + "cell_type": "markdown", |
| 151 | + "metadata": {}, |
| 152 | + "source": [ |
| 153 | + "### Method 2: Use AWS compatible Python Packages\n", |
| 154 | + "When you are dealing with large data sets, or do not want to lose any data when you delete your SageMaker Notebook Instance, you can use pre-built packages to access your files in S3 without copying files into your instance. These packages, such as `Pandas`, have implemented options to access data with a specified path string: while you will use `file://` on your local file system, you will use `s3://` instead to access the data through the AWS boto library. For `pandas`, any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected.You can find additional documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). " |
| 155 | + ] |
| 156 | + }, |
| 157 | + { |
| 158 | + "cell_type": "code", |
| 159 | + "execution_count": null, |
| 160 | + "metadata": {}, |
| 161 | + "outputs": [], |
| 162 | + "source": [ |
| 163 | + "data_s3_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename) # S3 URL\n", |
| 164 | + "s3_tabular_data = pd.read_csv(data_s3_location, nrows = 5)\n", |
| 165 | + "s3_tabular_data.head()" |
| 166 | + ] |
| 167 | + }, |
| 168 | + { |
| 169 | + "cell_type": "markdown", |
| 170 | + "metadata": {}, |
| 171 | + "source": [ |
| 172 | + "### Method 3: Use AWS native methods\n", |
| 173 | + "#### 3.1 s3fs \n", |
| 174 | + "\n", |
| 175 | + "[S3Fs](https://s3fs.readthedocs.io/en/latest/) is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3. " |
| 176 | + ] |
| 177 | + }, |
| 178 | + { |
| 179 | + "cell_type": "code", |
| 180 | + "execution_count": null, |
| 181 | + "metadata": {}, |
| 182 | + "outputs": [], |
| 183 | + "source": [ |
| 184 | + "fs = s3fs.S3FileSystem()\n", |
| 185 | + "data_s3fs_location = \"s3://{}/{}/\".format(bucket, prefix)\n", |
| 186 | + "# To List all files in your accessible bucket\n", |
| 187 | + "fs.ls(data_s3fs_location)" |
| 188 | + ] |
| 189 | + }, |
| 190 | + { |
| 191 | + "cell_type": "code", |
| 192 | + "execution_count": null, |
| 193 | + "metadata": {}, |
| 194 | + "outputs": [], |
| 195 | + "source": [ |
| 196 | + "# open it directly with s3fs\n", |
| 197 | + "data_s3fs_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename) # S3 URL\n", |
| 198 | + "with fs.open(data_s3fs_location) as f:\n", |
| 199 | + " print(pd.read_csv(f, nrows = 5))" |
| 200 | + ] |
| 201 | + }, |
| 202 | + { |
| 203 | + "cell_type": "markdown", |
| 204 | + "metadata": {}, |
| 205 | + "source": [ |
| 206 | + "#### 3.2 AWS Data Wrangler\n", |
| 207 | + "[AWS Data Wrangler](https://github.com/awslabs/aws-data-wrangler) is an open-source Python library that extends the power of the Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, Amazon QuickSight, etc), which we will cover in later sections. It is built on top of other open-source projects like Pandas, Apache Arrow, Boto3, s3fs, SQLAlchemy, Psycopg2 and PyMySQL, and offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases. Note that you would need `s3fs version > 0.4.0` for the `awswrangler csv reader` to work." |
| 208 | + ] |
| 209 | + }, |
| 210 | + { |
| 211 | + "cell_type": "code", |
| 212 | + "execution_count": null, |
| 213 | + "metadata": {}, |
| 214 | + "outputs": [], |
| 215 | + "source": [ |
| 216 | + "data_wr_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename) # S3 URL\n", |
| 217 | + "wr_data = wr.s3.read_csv(path=data_wr_location, nrows = 5)\n", |
| 218 | + "wr_data.head()" |
| 219 | + ] |
| 220 | + }, |
| 221 | + { |
| 222 | + "cell_type": "markdown", |
| 223 | + "metadata": {}, |
| 224 | + "source": [ |
| 225 | + "### Citation\n", |
| 226 | + "Boston Housing data, Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978." |
| 227 | + ] |
| 228 | + } |
| 229 | + ], |
| 230 | + "metadata": { |
| 231 | + "kernelspec": { |
| 232 | + "display_name": "conda_python3", |
| 233 | + "language": "python", |
| 234 | + "name": "conda_python3" |
| 235 | + }, |
| 236 | + "language_info": { |
| 237 | + "codemirror_mode": { |
| 238 | + "name": "ipython", |
| 239 | + "version": 3 |
| 240 | + }, |
| 241 | + "file_extension": ".py", |
| 242 | + "mimetype": "text/x-python", |
| 243 | + "name": "python", |
| 244 | + "nbconvert_exporter": "python", |
| 245 | + "pygments_lexer": "ipython3", |
| 246 | + "version": "3.6.10" |
| 247 | + } |
| 248 | + }, |
| 249 | + "nbformat": 4, |
| 250 | + "nbformat_minor": 4 |
| 251 | +} |
0 commit comments