Merge pull request aws#92 from awslabs/seq2seq

saurabh3949 · web-flow · commit 7c2974eaa217 · 2017-11-27T14:40:08.000-08:00
Incorporate comments from notebook bash
diff --git a/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.ipynb b/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.ipynb
@@ -133,6 +133,25 @@
     "Please note that it is a common practise to split words into subwords using Byte Pair Encoding (BPE). Please refer to [this](https://github.com/awslabs/sockeye/tree/master/tutorials/wmt) tutorial if you are interested in performing BPE."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since training on the whole dataset might take several hours/days, for this demo, let us train on the **first 10,000 lines only**. Don't run the next cell if you want to train on the complete dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "!head -n 10000 corpus.tc.en > corpus.tc.en.small\n",
+    "!head -n 10000 corpus.tc.de > corpus.tc.de.small"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -155,22 +174,20 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's do the preprocessing now. Sit back and relax as the script below might take around 10-15 min"
+    "The cell below does the preprocessing. If you are using the complete dataset, the script might take around 10-15 min on an m4.xlarge notebook instance. Remove \".small\" from the file names for training on full datasets."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%%time\n",
     "%%bash\n",
     "python3 create_vocab_proto.py \\\n",
-    "        --train-source corpus.tc.en \\\n",
-    "        --train-target corpus.tc.de \\\n",
+    "        --train-source corpus.tc.en.small \\\n",
+    "        --train-target corpus.tc.de.small \\\n",
     "        --val-source validation/newstest2014.tc.en \\\n",
     "        --val-target validation/newstest2014.tc.de"
    ]
@@ -222,9 +239,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/seq2seq:latest',\n",
@@ -245,12 +260,10 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "job_name = 'seq2seq-en-de-small-p2-16x-' + strftime(\"%Y-%m-%d-%H\", gmtime())\n",
+    "job_name = 'seq2seq-en-de-p2-xlarge-' + strftime(\"%Y-%m-%d-%H\", gmtime())\n",
     "print(\"Training job\", job_name)\n",
     "\n",
     "create_training_params = \\\n",
@@ -266,7 +279,7 @@
     "    \"ResourceConfig\": {\n",
     "        # Seq2Seq does not support multiple machines. Currently, it only supports single machine, multiple GPUs\n",
     "        \"InstanceCount\": 1,\n",
-    "        \"InstanceType\": \"ml.p2.16xlarge\", # We suggest one of [\"ml.p2.16xlarge\", \"ml.p2.8xlarge\", \"ml.p2.xlarge\"]\n",
+    "        \"InstanceType\": \"ml.p2.xlarge\", # We suggest one of [\"ml.p2.16xlarge\", \"ml.p2.8xlarge\", \"ml.p2.xlarge\"]\n",
     "        \"VolumeSizeInGB\": 50\n",
     "    },\n",
     "    \"TrainingJobName\": job_name,\n",
@@ -275,14 +288,17 @@
     "        \"max_seq_len_source\": \"60\",\n",
     "        \"max_seq_len_target\": \"60\",\n",
     "        \"optimized_metric\": \"bleu\",\n",
-    "        \"batch_size\": \"256\",\n",
+    "        \"batch_size\": \"64\", # Please use a larger batch size (256 or 512) if using ml.p2.8xlarge or ml.p2.16xlarge\n",
     "        \"checkpoint_frequency_num_batches\": \"1000\",\n",
     "        \"rnn_num_hidden\": \"512\",\n",
     "        \"num_layers_encoder\": \"1\",\n",
     "        \"num_layers_decoder\": \"1\",\n",
     "        \"num_embed_source\": \"512\",\n",
     "        \"num_embed_target\": \"512\",\n",
     "        \"checkpoint_threshold\": \"3\",\n",
+    "        \"max_num_batches\": \"2100\"\n",
+    "        # Training will stop after 2100 iterations/batches.\n",
+    "        # This is just for demo purposes. Remove the above parameter if you want a better model.\n",
     "    },\n",
     "    \"StoppingCondition\": {\n",
     "        \"MaxRuntimeInSeconds\": 48 * 3600\n",
@@ -331,9 +347,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "status = sagemaker_client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']\n",
@@ -352,6 +366,13 @@
     "> Now wait for the training job to complete and proceed to the next step after you see model artifacts in your S3 bucket."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can jump to [Use a pretrained model](#Use-a-pretrained-model) as training might take some time."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -366,9 +387,7 @@
     "- Perform Inference - Perform inference on some input data using the endpoint.\n",
     "\n",
     "### Create model\n",
-    "We now create a SageMaker Model from the training output. Using the model, we can then create an Endpoint Configuration.\n",
-    "\n",
-    "#### Note: Please uncomment and run the lines below if you want to use a pretrained model, as training might take several hours/days to complete."
+    "We now create a SageMaker Model from the training output. Using the model, we can then create an Endpoint Configuration."
    ]
   },
   {
@@ -379,37 +398,48 @@
    },
    "outputs": [],
    "source": [
+    "use_pretrained_model = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Use a pretrained model\n",
+    "#### Please uncomment and run the cell below if you want to use a pretrained model, as training might take several hours/days to complete."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# use_pretrained_model = True\n",
+    "# model_name = \"pretrained-en-de-model\"\n",
     "# !curl https://s3-us-west-2.amazonaws.com/gsaur-seq2seq-data/seq2seq/eng-german/full-nb-translation-eng-german-p2-16x-2017-11-24-22-25-53/output/model.tar.gz > model.tar.gz\n",
     "# !curl https://s3-us-west-2.amazonaws.com/gsaur-seq2seq-data/seq2seq/eng-german/full-nb-translation-eng-german-p2-16x-2017-11-24-22-25-53/output/vocab.src.json > vocab.src.json\n",
     "# !curl https://s3-us-west-2.amazonaws.com/gsaur-seq2seq-data/seq2seq/eng-german/full-nb-translation-eng-german-p2-16x-2017-11-24-22-25-53/output/vocab.trg.json > vocab.trg.json\n",
     "# upload_to_s3(bucket, prefix, 'pretrained_model', 'model.tar.gz')\n",
-    "# use_pretrained_model = True\n",
     "# model_data = \"s3://{}/{}/pretrained_model/model.tar.gz\".format(bucket, prefix)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%%time\n",
     "\n",
     "sage = boto3.client('sagemaker')\n",
     "\n",
-    "model_name=job_name\n",
-    "print(model_name)\n",
-    "\n",
-    "info = sage.describe_training_job(TrainingJobName=job_name)\n",
-    "\n",
-    "try:\n",
-    "    if use_pretrained_model:\n",
-    "        model_data\n",
-    "except:\n",
+    "if not use_pretrained_model:\n",
+    "    info = sage.describe_training_job(TrainingJobName=job_name)\n",
+    "    model_name=job_name\n",
     "    model_data = info['ModelArtifacts']['S3ModelArtifacts']\n",
-    "    \n",
+    "\n",
+    "print(model_name)\n",
     "print(model_data)\n",
     "\n",
     "primary_container = {\n",
@@ -438,9 +468,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "from time import gmtime, strftime\n",
@@ -469,9 +497,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%%time\n",
@@ -547,13 +573,12 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "sentences = [\"you are so good !\",\n",
     "             \"can you drive a car ?\",\n",
+    "             \"i want to watch a movie .\"\n",
     "            ]\n",
     "\n",
     "payload = {\"instances\" : []}\n",
@@ -586,9 +611,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "sentence = 'can you drive a car ?'\n",
@@ -639,9 +662,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "plot_matrix(attention_matrix, target, source)"
@@ -666,9 +687,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "import io\n",
@@ -772,9 +791,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "targets = _parse_proto_response(response)\n",
@@ -795,9 +812,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "# sage.delete_endpoint(EndpointName=endpoint_name)"