Skip to content

Ensure MXNet notebooks run in distributed mode. #191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 12, 2018

Conversation

iquintero
Copy link

Improved the notebooks so that they split the training data.

Improved the notebooks so that they split the training data.
Copy link
Contributor

@djarpin djarpin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the cifar example, these updates currently fail with the py3 versions of the containers. I think you just need to round the division in those cases and it should be fine.

We might also want to make a note that customers could use ShardedByS3Key for data parallel training. But this would require pre-processing the data to split it into even chunks before writing to multiple objects in S3. I think we could just add this as a comment in the code.

# distributed training.
if len(hosts) > 1:
train_data = [x for x in train_data]
shard_size = len(train_data) / len(hosts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't work when py_version is set to 'py3'. Also, they leave a small amount of data out in py2. Not sure that's a huge issue though since len(hosts) is typically going to be pretty small.

@@ -46,7 +46,14 @@ def train(current_host, hosts, num_cpus, num_gpus, channel_input_dirs, model_dir
train_sentences = [[vocab.get(token, 1) for token in line if len(line)>0] for line in train_sentences]
val_sentences = [[vocab.get(token, 1) for token in line if len(line)>0] for line in val_sentences]

train_iterator = BucketSentenceIter(train_sentences, train_labels, batch_size)
shard_size = len(train_sentences) / len(hosts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as for mxnet_gluon_mnist/mnist.py

(train_labels, train_images) = load_data(os.path.join(channel_input_dirs['train']))
(test_labels, test_images) = load_data(os.path.join(channel_input_dirs['test']))

shard_size = len(train_images) / len(hosts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as for mxnet_gluon_mnist/mnist.py.

Also added a comment about ShardedByS3Key.
@djarpin djarpin merged commit a240855 into aws:master Mar 12, 2018
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants