-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Ensure MXNet notebooks run in distributed mode. #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improved the notebooks so that they split the training data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the cifar example, these updates currently fail with the py3 versions of the containers. I think you just need to round the division in those cases and it should be fine.
We might also want to make a note that customers could use ShardedByS3Key for data parallel training. But this would require pre-processing the data to split it into even chunks before writing to multiple objects in S3. I think we could just add this as a comment in the code.
# distributed training. | ||
if len(hosts) > 1: | ||
train_data = [x for x in train_data] | ||
shard_size = len(train_data) / len(hosts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't work when py_version is set to 'py3'. Also, they leave a small amount of data out in py2. Not sure that's a huge issue though since len(hosts) is typically going to be pretty small.
@@ -46,7 +46,14 @@ def train(current_host, hosts, num_cpus, num_gpus, channel_input_dirs, model_dir | |||
train_sentences = [[vocab.get(token, 1) for token in line if len(line)>0] for line in train_sentences] | |||
val_sentences = [[vocab.get(token, 1) for token in line if len(line)>0] for line in val_sentences] | |||
|
|||
train_iterator = BucketSentenceIter(train_sentences, train_labels, batch_size) | |||
shard_size = len(train_sentences) / len(hosts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note as for mxnet_gluon_mnist/mnist.py
(train_labels, train_images) = load_data(os.path.join(channel_input_dirs['train'])) | ||
(test_labels, test_images) = load_data(os.path.join(channel_input_dirs['test'])) | ||
|
||
shard_size = len(train_images) / len(hosts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note as for mxnet_gluon_mnist/mnist.py.
Also added a comment about ShardedByS3Key.
Improved the notebooks so that they split the training data.