|
1 | 1 | {
|
2 | 2 | "cells": [
|
3 |
| - { |
4 |
| - "cell_type": "markdown", |
5 |
| - "metadata": {}, |
6 |
| - "source": [ |
7 |
| - "**HOW TO INSTALL THE PROXY PLUGIN ON MEAD**\n", |
8 |
| - "**ATTENTION** THIS SECTION WILL REMOVED AFTER THE MEAD PROXY PLUGIN IS INSTALLED IN MEAD USER DATA. THESE INSTRUCTIONS WILL NOT BE PART OF THE NOTEBOOK FOR GA.\n", |
9 |
| - "\n", |
10 |
| - "OPEN A TERMINAL IN JUPYTER:\n", |
11 |
| - "File->Open->New->Terminal\n", |
12 |
| - "\n", |
13 |
| - "```\n", |
14 |
| - "sudo su\n", |
15 |
| - "source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv\n", |
16 |
| - "pip install git+https://github.com/jupyterhub/[email protected]\n", |
17 |
| - "jupyter serverextension enable --py nbserverproxy --sys-prefix\n", |
18 |
| - "source deactivate\n", |
19 |
| - "restart part-003\n", |
20 |
| - "```\n", |
21 |
| - "\n", |
22 |
| - "```restart part-003``` will restart the jupyter notebook and install the required plugin to run tensorboard." |
23 |
| - ] |
24 |
| - }, |
25 | 3 | {
|
26 | 4 | "cell_type": "markdown",
|
27 | 5 | "metadata": {},
|
28 | 6 | "source": [
|
29 | 7 | "# ResNet CIFAR-10 with tensorboard\n",
|
30 | 8 | "\n",
|
31 |
| - "This notebook details how to use TensorBoard, and how the training job writes checkpoints to a external bucket.\n", |
32 |
| - "The model used for this notebook is a RestNet model, against the CIFAR-10 dataset.\n", |
| 9 | + "This notebook shows how to use TensorBoard, and how the training job writes checkpoints to a external bucket.\n", |
| 10 | + "The model used for this notebook is a RestNet model, trained with the CIFAR-10 dataset.\n", |
33 | 11 | "See the following papers for more background:\n",
|
34 | 12 | "\n",
|
35 | 13 | "[Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Dec 2015.\n",
|
|
41 | 19 | "cell_type": "markdown",
|
42 | 20 | "metadata": {},
|
43 | 21 | "source": [
|
44 |
| - "### Let's start by setting up the environment." |
| 22 | + "### Set up the environment" |
45 | 23 | ]
|
46 | 24 | },
|
47 | 25 | {
|
|
66 | 44 | "cell_type": "markdown",
|
67 | 45 | "metadata": {},
|
68 | 46 | "source": [
|
69 |
| - "### Downloading CIFAR-10 dataset\n", |
| 47 | + "### Download the CIFAR-10 dataset\n", |
70 | 48 | "Downloading the test and training data will take around 5 minutes."
|
71 | 49 | ]
|
72 | 50 | },
|
|
95 | 73 | "cell_type": "markdown",
|
96 | 74 | "metadata": {},
|
97 | 75 | "source": [
|
98 |
| - "### Uploading the data to a S3 bucket" |
| 76 | + "### Upload the data to a S3 bucket" |
99 | 77 | ]
|
100 | 78 | },
|
101 | 79 | {
|
|
120 | 98 | "cell_type": "markdown",
|
121 | 99 | "metadata": {},
|
122 | 100 | "source": [
|
123 |
| - "### Complete source code" |
124 |
| - ] |
125 |
| - }, |
126 |
| - { |
127 |
| - "cell_type": "code", |
128 |
| - "execution_count": 4, |
129 |
| - "metadata": { |
130 |
| - "scrolled": false |
131 |
| - }, |
132 |
| - "outputs": [ |
133 |
| - { |
134 |
| - "name": "stdout", |
135 |
| - "output_type": "stream", |
136 |
| - "text": [ |
137 |
| - ".\r\n", |
138 |
| - "├── __init__.py\r\n", |
139 |
| - "├── __pycache__\r\n", |
140 |
| - "│ └── utils.cpython-36.pyc\r\n", |
141 |
| - "├── source_dir\r\n", |
142 |
| - "│ ├── __init__.py\r\n", |
143 |
| - "│ ├── resnet_cifar_10.py\r\n", |
144 |
| - "│ └── resnet_model.py\r\n", |
145 |
| - "├── tensorflow_resnet_cifar10_with_tensorboard.ipynb\r\n", |
146 |
| - "└── utils.py\r\n", |
147 |
| - "\r\n", |
148 |
| - "2 directories, 7 files\r\n" |
149 |
| - ] |
150 |
| - } |
151 |
| - ], |
152 |
| - "source": [ |
153 |
| - "!tree" |
| 101 | + "### Complete source code\n", |
| 102 | + "- [source_dir/resnet_model.py](source_dir/resnet_model.py): ResNet model\n", |
| 103 | + "- [source_dir/resnet_cifar_10.py](source_dir/resnet_cifar_10.py): main script used for training and hosting" |
154 | 104 | ]
|
155 | 105 | },
|
156 | 106 | {
|
157 | 107 | "cell_type": "markdown",
|
158 | 108 | "metadata": {},
|
159 | 109 | "source": [
|
160 |
| - "## Running TensorFlow training on SageMaker" |
| 110 | + "## Create a training job using the sagemaker.TensorFlow estimator" |
161 | 111 | ]
|
162 | 112 | },
|
163 | 113 | {
|
164 | 114 | "cell_type": "code",
|
165 | 115 | "execution_count": null,
|
166 | 116 | "metadata": {
|
| 117 | + "collapsed": true, |
167 | 118 | "scrolled": false
|
168 | 119 | },
|
169 | 120 | "outputs": [],
|
170 | 121 | "source": [
|
171 | 122 | "from sagemaker.tensorflow import TensorFlow\n",
|
172 | 123 | "\n",
|
173 | 124 | "\n",
|
174 |
| - "sorce_dir = os.path.join(os.getcwd(), 'source_dir')\n", |
| 125 | + "source_dir = os.path.join(os.getcwd(), 'source_dir')\n", |
175 | 126 | "estimator = TensorFlow(entry_point='resnet_cifar_10.py',\n",
|
176 |
| - " source_dir=sorce_dir,\n", |
| 127 | + " source_dir=source_dir,\n", |
177 | 128 | " role=role,\n",
|
178 |
| - " hyperparameters={'training_steps': 1000, 'evaluation_steps': 100},\n", |
179 |
| - " train_instance_count=2, train_instance_type='ml.p2.xlarge', \n", |
| 129 | + " training_steps=1000, evaluation_steps=100,\n", |
| 130 | + " train_instance_count=1, train_instance_type='ml.p2.xlarge', \n", |
180 | 131 | " base_job_name='tensorboard-example')\n",
|
181 | 132 | "\n",
|
182 | 133 | "estimator.fit(inputs, run_tensorboard_locally=True)"
|
|
186 | 137 | "cell_type": "markdown",
|
187 | 138 | "metadata": {},
|
188 | 139 | "source": [
|
189 |
| - "The **```fit```** method will create a training job named **```tensorboard-example-{unique identifier}```** with 2 p2 instances. These instances will be writing checkpoints to the s3 bucket **```sagemaker-{your aws account number}```**, if you don't have this bucket yet, sagemaker_session will create it for you. These checkpoints can be used for restoring the training job, and to analyze training job metrics using **TensorBoard**. \n", |
| 140 | + "The **```fit```** method will create a training job named **```tensorboard-example-{unique identifier}```**in a p2 instance. That instance will write checkpoints to the s3 bucket **```sagemaker-{your aws account number}```**.\n", |
| 141 | + "\n", |
| 142 | + "If you don't have this bucket yet, **```sagemaker_session```** will create it for you. These checkpoints can be used for restoring the training job, and to analyze training job metrics using **TensorBoard**. \n", |
190 | 143 | "\n",
|
191 |
| - "The parameter **```run_tensorboard_locally=True```** will run **TensorBoard** in the machine that this notebook is running. Everytime a new checkpoint is created by the training job in the S3 bucket, **fit** will download the checkpoint to the temp folder that **TensorBoard** is pointing to.\n", |
| 144 | + "The parameter **```run_tensorboard_locally=True```** will run **TensorBoard** in the machine that this notebook is running. Everytime a new checkpoint is created by the training job in the S3 bucket, **```fit```** will download the checkpoint to the temp folder that **TensorBoard** is pointing to.\n", |
192 | 145 | "\n",
|
193 |
| - "When the **```fit```** method starts the training, it will log the port that **TensorBoard** is using to display the metrics. The default port is **6006**, but another port can be choosen depending on its availability.\n", |
| 146 | + "When the **```fit```** method starts the training, it will log the port that **TensorBoard** is using to display the metrics. The default port is **6006**, but another port can be choosen depending on its availability. The port number will increase until finds an available port. After that the port number will printed in stdout.\n", |
194 | 147 | "\n",
|
195 |
| - "**TensorBoard** will take some minutes to start displaying metrics, depending on how long the training job container take to start their jobs.\n", |
| 148 | + "It takes a few minutes to provision containers and start the training job.**TensorBoard** will start to display metrics shortly after that.\n", |
196 | 149 | "\n",
|
197 |
| - "You can access **Tensorboard** locally [http://localhost:6006](http://localhost:6006) or using your SakeMaker workspace [proxy/6006](/proxy/6006)" |
| 150 | + "You can access **Tensorboard** locally at [http://localhost:6006](http://localhost:6006) or using your SakeMaker workspace [proxy/6006](/proxy/6006). If TensorBoard started on a different port, adjust these URLs to match." |
198 | 151 | ]
|
199 | 152 | },
|
200 | 153 | {
|
|
223 | 176 | "cell_type": "markdown",
|
224 | 177 | "metadata": {},
|
225 | 178 | "source": [
|
226 |
| - "# Deleting the endpoint" |
| 179 | + "# Deleting the endpoint\n", |
| 180 | + "**Important** " |
227 | 181 | ]
|
228 | 182 | },
|
229 | 183 | {
|
|
0 commit comments