aws
diff --git a/‎training/distributed_training/pytorch/model_parallel/mnist/pytorch_smmodelparallel_mnist.ipynb
Lines changed: 226 additions & 0 deletions b/‎training/distributed_training/pytorch/model_parallel/mnist/pytorch_smmodelparallel_mnist.ipynb
Lines changed: 226 additions & 0 deletions
@@ -0,0 +1,226 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Use SageMaker Distributed Model Parallel with Amazon SageMaker to Launch Training Job with Model Parallelization\n",
+    "\n",
+    "SageMaker Distributed Model Parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SageMaker Distributed Model Parallel automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.\n",
+    "\n",
+    "Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using an example PyTorch training script and [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). \n",
+    "\n",
+    "\n",
+    "### Additional Resources\n",
+    "\n",
+    "If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with Pytorch. \n",
+    "\n",
+    "* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed\n",
+    "](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training-model-parallel.html).\n",
+    "\n",
+    "* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries)\n",
+    "\n",
+    "* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms\n",
+    "](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Amazon SageMaker Initialization\n",
+    "\n",
+    "Run the following cell to initialize the notebook instance. Get the SageMaker execution role used to run this notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "import sagemaker\n",
+    "from sagemaker import get_execution_role\n",
+    "from sagemaker.pytorch import PyTorch\n",
+    "from smexperiments.experiment import Experiment\n",
+    "from smexperiments.trial import Trial\n",
+    "import boto3\n",
+    "from time import gmtime, strftime\n",
+    "\n",
+    "role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role\n",
+    "print(f'SageMaker Execution Role:{role}')\n",
+    "\n",
+    "session = boto3.session.Session()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prepare your training script\n",
+    "\n",
+    "Run the following cell to view an example-training script for PyTorch 1.6"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run this cell to see an example of a training scripts that you can use to configure -\n",
+    "# SageMaker Distributed Model Parallel with PyTorch version 1.6\n",
+    "!cat utils/pt_mnist.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define SageMaker Training Job\n",
+    "\n",
+    "Next, you will use SageMaker Estimator API to define a SageMaker Training Job. You will use an [`Estimator`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to define the number and type of EC2 instances Amazon SageMaker uses for training, as well as the size of the volume attached to those instances. \n",
+    "\n",
+    "You can update the following:\n",
+    "* `processes_per_host`\n",
+    "* `entry_point`\n",
+    "* `instance_count`\n",
+    "* `instance_type`\n",
+    "* `base_job_name`\n",
+    "\n",
+    "In addition, you can supply and modify configuration parameters for the SageMaker Distributed Model Parallel library. These parameters will be passed in through the `distributions` argument, as shown below.\n",
+    "\n",
+    "### Update the Type and Number of EC2 Instances Used\n",
+    "\n",
+    "Specify your `processes_per_host`. Note that it must be a multiple of your partitions, which by default is 2.\n",
+    "\n",
+    "The instance type and number of instances you specify in `instance_type` and `instance_count` respectively will determine the number of GPUs Amazon SageMaker uses during training. Explicitly, `instance_type` will determine the number of GPUs on a single instance and that number will be multiplied by `instance_count`. \n",
+    "\n",
+    "You must specify values for `instance_type` and `instance_count` so that the total number of GPUs available for training is equal to `partitions` in `config` of `smp.init` in your training script. \n",
+    "\n",
+    "\n",
+    "To look up instances types, see [Amazon EC2 Instance Types](https://aws.amazon.com/sagemaker/pricing/).\n",
+    "\n",
+    "\n",
+    "### Uploading Checkpoint During Training or Resuming Checkpoint from Previous Training\n",
+    "We also provide a custom way for users to upload checkpoints during training or resume checkpoints from previous training. We have integrated this into our `pt_mnist.py` example script. Please see the functions `aws_s3_sync`, `sync_local_checkpoints_to_s3`, and `sync_s3_checkpoints_to_local`. For the purpose of this example, we are only uploading a checkpoint during training, by using `sync_local_checkpoints_to_s3`. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After you have updated `entry_point`, `instance_count`, `instance_type` and `base_job_name`, run the following to create an estimator. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sagemaker_session = sagemaker.session.Session(boto_session=session)\n",
+    "mpioptions = \"-verbose -x orte_base_help_aggregate=0 \"\n",
+    "mpioptions += \"--mca btl_vader_single_copy_mechanism none \"\n",
+    "\n",
+    "all_experiment_names = [exp.experiment_name for exp in Experiment.list()]\n",
+    "\n",
+    "#choose an experiment name (only need to create it once)\n",
+    "experiment_name = \"SM-MP-DEMO\"\n",
+    "\n",
+    "# Load the experiment if it exists, otherwise create \n",
+    "if experiment_name not in all_experiment_names:\n",
+    "    customer_churn_experiment = Experiment.create(\n",
+    "        experiment_name=experiment_name, sagemaker_boto_client=boto3.client(\"sagemaker\")\n",
+    "    )\n",
+    "else:\n",
+    "    customer_churn_experiment = Experiment.load(\n",
+    "        experiment_name=experiment_name, sagemaker_boto_client=boto3.client(\"sagemaker\")\n",
+    "    )\n",
+    "\n",
+    "# Create a trial for the current run\n",
+    "trial = Trial.create(\n",
+    "        trial_name=\"SMD-MP-demo-{}\".format(strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())),\n",
+    "        experiment_name=customer_churn_experiment.experiment_name,\n",
+    "        sagemaker_boto_client=boto3.client(\"sagemaker\"),\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "smd_mp_estimator = PyTorch(\n",
+    "          entry_point=\"pt_mnist.py\", # Pick your train script\n",
+    "          source_dir='utils',\n",
+    "          role=role,\n",
+    "          instance_type='ml.p3.16xlarge',\n",
+    "          sagemaker_session=sagemaker_session,\n",
+    "          framework_version='1.6.0',\n",
+    "          py_version='py3',\n",
+    "          instance_count=1,\n",
+    "          distribution={\n",
+    "              \"smdistributed\": {\n",
+    "                  \"modelparallel\": {\n",
+    "                      \"enabled\":True,\n",
+    "                      \"parameters\": {\n",
+    "                          \"microbatches\": 4,\n",
+    "                          \"placement_strategy\": \"spread\",\n",
+    "                          \"pipeline\": \"interleaved\",\n",
+    "                          \"optimize\": \"speed\",\n",
+    "                          \"partitions\": 2,\n",
+    "                          \"ddp\": 1,\n",
+    "                      }\n",
+    "                  }\n",
+    "              },\n",
+    "              \"mpi\": {\n",
+    "                    \"enabled\": True,\n",
+    "                    \"processes_per_host\": 2, # Pick your processes_per_host\n",
+    "                    \"custom_mpi_options\": mpioptions \n",
+    "              },\n",
+    "          },\n",
+    "          base_job_name=\"SMD-MP-demo\",\n",
+    "      )\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, you will use the estimator to launch the SageMaker training job."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "smd_mp_estimator.fit(\n",
+    "        experiment_config={\n",
+    "            \"ExperimentName\": customer_churn_experiment.experiment_name,\n",
+    "            \"TrialName\": trial.trial_name,\n",
+    "            \"TrialComponentDisplayName\": \"Training\",\n",
+    "        })"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "conda_python3",
+   "language": "python",
+   "name": "conda_python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}