Ensure MXNet notebooks run in distributed mode. #191

iquintero · 2018-02-14T17:14:48Z

Improved the notebooks so that they split the training data.

djarpin

Except for the cifar example, these updates currently fail with the py3 versions of the containers. I think you just need to round the division in those cases and it should be fine.

We might also want to make a note that customers could use ShardedByS3Key for data parallel training. But this would require pre-processing the data to split it into even chunks before writing to multiple objects in S3. I think we could just add this as a comment in the code.

djarpin · 2018-02-19T20:15:45Z

sagemaker-python-sdk/mxnet_gluon_mnist/mnist.py

+    # distributed training.
+    if len(hosts) > 1:
+        train_data = [x for x in train_data]
+        shard_size = len(train_data) / len(hosts)


These don't work when py_version is set to 'py3'. Also, they leave a small amount of data out in py2. Not sure that's a huge issue though since len(hosts) is typically going to be pretty small.

djarpin · 2018-02-19T20:16:20Z

sagemaker-python-sdk/mxnet_gluon_sentiment/sentiment.py

@@ -46,7 +46,14 @@ def train(current_host, hosts, num_cpus, num_gpus, channel_input_dirs, model_dir
    train_sentences = [[vocab.get(token, 1) for token in line if len(line)>0] for line in train_sentences]
    val_sentences = [[vocab.get(token, 1) for token in line if len(line)>0] for line in val_sentences]

-    train_iterator = BucketSentenceIter(train_sentences, train_labels, batch_size)
+    shard_size = len(train_sentences) / len(hosts)


Same note as for mxnet_gluon_mnist/mnist.py

djarpin · 2018-02-19T20:16:36Z

sagemaker-python-sdk/mxnet_mnist/mnist.py

    (train_labels, train_images) = load_data(os.path.join(channel_input_dirs['train']))
    (test_labels, test_images) = load_data(os.path.join(channel_input_dirs['test']))
+
+    shard_size = len(train_images) / len(hosts)


Same note as for mxnet_gluon_mnist/mnist.py.

Also added a comment about ShardedByS3Key.

Ensure MXNet notebooks run in distributed mode.

99bbaf6

Improved the notebooks so that they split the training data.

djarpin reviewed Feb 19, 2018

View reviewed changes

Fix division for python3.

4196324

Also added a comment about ShardedByS3Key.

djarpin approved these changes Mar 12, 2018

View reviewed changes

djarpin merged commit a240855 into aws:master Mar 12, 2018

atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022

Fix example (aws#191)

d20a1d3

atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022

fix multiple inheritance (aws#191)

a0623ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure MXNet notebooks run in distributed mode. #191

Ensure MXNet notebooks run in distributed mode. #191

Uh oh!

iquintero commented Feb 14, 2018

Uh oh!

djarpin left a comment

Uh oh!

djarpin Feb 19, 2018

Uh oh!

djarpin Feb 19, 2018

Uh oh!

djarpin Feb 19, 2018

Uh oh!

Uh oh!

Ensure MXNet notebooks run in distributed mode. #191

Ensure MXNet notebooks run in distributed mode. #191

Uh oh!

Conversation

iquintero commented Feb 14, 2018

Uh oh!

djarpin left a comment

Choose a reason for hiding this comment

Uh oh!

djarpin Feb 19, 2018

Choose a reason for hiding this comment

Uh oh!

djarpin Feb 19, 2018

Choose a reason for hiding this comment

Uh oh!

djarpin Feb 19, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!