Skip to content

Commit 4f6a7ce

Browse files
committed
pr comments
1 parent 03a3a9d commit 4f6a7ce

File tree

1 file changed

+23
-69
lines changed

1 file changed

+23
-69
lines changed

sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb

Lines changed: 23 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,13 @@
11
{
22
"cells": [
3-
{
4-
"cell_type": "markdown",
5-
"metadata": {},
6-
"source": [
7-
"**HOW TO INSTALL THE PROXY PLUGIN ON MEAD**\n",
8-
"**ATTENTION** THIS SECTION WILL REMOVED AFTER THE MEAD PROXY PLUGIN IS INSTALLED IN MEAD USER DATA. THESE INSTRUCTIONS WILL NOT BE PART OF THE NOTEBOOK FOR GA.\n",
9-
"\n",
10-
"OPEN A TERMINAL IN JUPYTER:\n",
11-
"File->Open->New->Terminal\n",
12-
"\n",
13-
"```\n",
14-
"sudo su\n",
15-
"source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv\n",
16-
"pip install git+https://github.com/jupyterhub/[email protected]\n",
17-
"jupyter serverextension enable --py nbserverproxy --sys-prefix\n",
18-
"source deactivate\n",
19-
"restart part-003\n",
20-
"```\n",
21-
"\n",
22-
"```restart part-003``` will restart the jupyter notebook and install the required plugin to run tensorboard."
23-
]
24-
},
253
{
264
"cell_type": "markdown",
275
"metadata": {},
286
"source": [
297
"# ResNet CIFAR-10 with tensorboard\n",
308
"\n",
31-
"This notebook details how to use TensorBoard, and how the training job writes checkpoints to a external bucket.\n",
32-
"The model used for this notebook is a RestNet model, against the CIFAR-10 dataset.\n",
9+
"This notebook shows how to use TensorBoard, and how the training job writes checkpoints to a external bucket.\n",
10+
"The model used for this notebook is a RestNet model, trained with the CIFAR-10 dataset.\n",
3311
"See the following papers for more background:\n",
3412
"\n",
3513
"[Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Dec 2015.\n",
@@ -41,7 +19,7 @@
4119
"cell_type": "markdown",
4220
"metadata": {},
4321
"source": [
44-
"### Let's start by setting up the environment."
22+
"### Set up the environment"
4523
]
4624
},
4725
{
@@ -66,7 +44,7 @@
6644
"cell_type": "markdown",
6745
"metadata": {},
6846
"source": [
69-
"### Downloading CIFAR-10 dataset\n",
47+
"### Download the CIFAR-10 dataset\n",
7048
"Downloading the test and training data will take around 5 minutes."
7149
]
7250
},
@@ -95,7 +73,7 @@
9573
"cell_type": "markdown",
9674
"metadata": {},
9775
"source": [
98-
"### Uploading the data to a S3 bucket"
76+
"### Upload the data to a S3 bucket"
9977
]
10078
},
10179
{
@@ -120,63 +98,36 @@
12098
"cell_type": "markdown",
12199
"metadata": {},
122100
"source": [
123-
"### Complete source code"
124-
]
125-
},
126-
{
127-
"cell_type": "code",
128-
"execution_count": 4,
129-
"metadata": {
130-
"scrolled": false
131-
},
132-
"outputs": [
133-
{
134-
"name": "stdout",
135-
"output_type": "stream",
136-
"text": [
137-
".\r\n",
138-
"├── __init__.py\r\n",
139-
"├── __pycache__\r\n",
140-
"│   └── utils.cpython-36.pyc\r\n",
141-
"├── source_dir\r\n",
142-
"│   ├── __init__.py\r\n",
143-
"│   ├── resnet_cifar_10.py\r\n",
144-
"│   └── resnet_model.py\r\n",
145-
"├── tensorflow_resnet_cifar10_with_tensorboard.ipynb\r\n",
146-
"└── utils.py\r\n",
147-
"\r\n",
148-
"2 directories, 7 files\r\n"
149-
]
150-
}
151-
],
152-
"source": [
153-
"!tree"
101+
"### Complete source code\n",
102+
"- [source_dir/resnet_model.py](source_dir/resnet_model.py): ResNet model\n",
103+
"- [source_dir/resnet_cifar_10.py](source_dir/resnet_cifar_10.py): main script used for training and hosting"
154104
]
155105
},
156106
{
157107
"cell_type": "markdown",
158108
"metadata": {},
159109
"source": [
160-
"## Running TensorFlow training on SageMaker"
110+
"## Create a training job using the sagemaker.TensorFlow estimator"
161111
]
162112
},
163113
{
164114
"cell_type": "code",
165115
"execution_count": null,
166116
"metadata": {
117+
"collapsed": true,
167118
"scrolled": false
168119
},
169120
"outputs": [],
170121
"source": [
171122
"from sagemaker.tensorflow import TensorFlow\n",
172123
"\n",
173124
"\n",
174-
"sorce_dir = os.path.join(os.getcwd(), 'source_dir')\n",
125+
"source_dir = os.path.join(os.getcwd(), 'source_dir')\n",
175126
"estimator = TensorFlow(entry_point='resnet_cifar_10.py',\n",
176-
" source_dir=sorce_dir,\n",
127+
" source_dir=source_dir,\n",
177128
" role=role,\n",
178-
" hyperparameters={'training_steps': 1000, 'evaluation_steps': 100},\n",
179-
" train_instance_count=2, train_instance_type='ml.p2.xlarge', \n",
129+
" training_steps=1000, evaluation_steps=100,\n",
130+
" train_instance_count=1, train_instance_type='ml.p2.xlarge', \n",
180131
" base_job_name='tensorboard-example')\n",
181132
"\n",
182133
"estimator.fit(inputs, run_tensorboard_locally=True)"
@@ -186,15 +137,17 @@
186137
"cell_type": "markdown",
187138
"metadata": {},
188139
"source": [
189-
"The **```fit```** method will create a training job named **```tensorboard-example-{unique identifier}```** with 2 p2 instances. These instances will be writing checkpoints to the s3 bucket **```sagemaker-{your aws account number}```**, if you don't have this bucket yet, sagemaker_session will create it for you. These checkpoints can be used for restoring the training job, and to analyze training job metrics using **TensorBoard**. \n",
140+
"The **```fit```** method will create a training job named **```tensorboard-example-{unique identifier}```**in a p2 instance. That instance will write checkpoints to the s3 bucket **```sagemaker-{your aws account number}```**.\n",
141+
"\n",
142+
"If you don't have this bucket yet, **```sagemaker_session```** will create it for you. These checkpoints can be used for restoring the training job, and to analyze training job metrics using **TensorBoard**. \n",
190143
"\n",
191-
"The parameter **```run_tensorboard_locally=True```** will run **TensorBoard** in the machine that this notebook is running. Everytime a new checkpoint is created by the training job in the S3 bucket, **fit** will download the checkpoint to the temp folder that **TensorBoard** is pointing to.\n",
144+
"The parameter **```run_tensorboard_locally=True```** will run **TensorBoard** in the machine that this notebook is running. Everytime a new checkpoint is created by the training job in the S3 bucket, **```fit```** will download the checkpoint to the temp folder that **TensorBoard** is pointing to.\n",
192145
"\n",
193-
"When the **```fit```** method starts the training, it will log the port that **TensorBoard** is using to display the metrics. The default port is **6006**, but another port can be choosen depending on its availability.\n",
146+
"When the **```fit```** method starts the training, it will log the port that **TensorBoard** is using to display the metrics. The default port is **6006**, but another port can be choosen depending on its availability. The port number will increase until finds an available port. After that the port number will printed in stdout.\n",
194147
"\n",
195-
"**TensorBoard** will take some minutes to start displaying metrics, depending on how long the training job container take to start their jobs.\n",
148+
"It takes a few minutes to provision containers and start the training job.**TensorBoard** will start to display metrics shortly after that.\n",
196149
"\n",
197-
"You can access **Tensorboard** locally [http://localhost:6006](http://localhost:6006) or using your SakeMaker workspace [proxy/6006](/proxy/6006)"
150+
"You can access **Tensorboard** locally at [http://localhost:6006](http://localhost:6006) or using your SakeMaker workspace [proxy/6006](/proxy/6006). If TensorBoard started on a different port, adjust these URLs to match."
198151
]
199152
},
200153
{
@@ -223,7 +176,8 @@
223176
"cell_type": "markdown",
224177
"metadata": {},
225178
"source": [
226-
"# Deleting the endpoint"
179+
"# Deleting the endpoint\n",
180+
"**Important** "
227181
]
228182
},
229183
{

0 commit comments

Comments
 (0)