-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add Local Mode support #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
02dd3a7
c38f464
967eb31
5126211
6a48ddb
245cbbe
95e99c7
49dc17d
3c84845
f841623
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ | |
from sagemaker.fw_utils import tar_and_upload_dir | ||
from sagemaker.fw_utils import parse_s3_url | ||
from sagemaker.fw_utils import UploadedCode | ||
from sagemaker.local.local_session import LocalSession | ||
from sagemaker.model import Model | ||
from sagemaker.model import (SCRIPT_PARAM_NAME, DIR_PARAM_NAME, CLOUDWATCH_METRICS_PARAM_NAME, | ||
CONTAINER_LOG_LEVEL_PARAM_NAME, JOB_NAME_PARAM_NAME, SAGEMAKER_REGION_PARAM_NAME) | ||
|
@@ -78,7 +79,17 @@ def __init__(self, role, train_instance_count, train_instance_type, | |
self.train_volume_size = train_volume_size | ||
self.train_max_run = train_max_run | ||
self.input_mode = input_mode | ||
self.sagemaker_session = sagemaker_session or Session() | ||
|
||
if self.train_instance_type in ('local', 'local_gpu'): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe we should have a enums containing the available instance types and a description of them. What do you think? |
||
self.local_mode = True | ||
if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Create an issue to support distributed local gpu in future, but that's going to be tough to do if the local machine only has a single gpu. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should use constants or the suggested enum instead of repeating the string here. |
||
raise RuntimeError("Distributed Training in Local GPU is not supported") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really want to throw an exception here? A more permissive alternative is to throw a warning educating the customer which scenarios that does not work and why. |
||
|
||
self.sagemaker_session = LocalSession() | ||
else: | ||
self.local_mode = False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need the |
||
self.sagemaker_session = sagemaker_session or Session() | ||
|
||
self.base_job_name = base_job_name | ||
self._current_job_name = None | ||
self.output_path = output_path | ||
|
@@ -303,7 +314,7 @@ def start_new(cls, estimator, inputs): | |
"""Create a new Amazon SageMaker training job from the estimator. | ||
|
||
Args: | ||
estimator (sagemaker.estimator.Framework): Estimator object created by the user. | ||
estimator (sagemaker.estimator.EstimatorBase): Estimator object created by the user. | ||
inputs (str): Parameters used when called :meth:`~sagemaker.estimator.EstimatorBase.fit`. | ||
|
||
Returns: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"). You | ||
# may not use this file except in compliance with the License. A copy of | ||
# the License is located at | ||
# | ||
# http://aws.amazon.com/apache2.0/ | ||
# | ||
# or in the "license" file accompanying this file. This file is | ||
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF | ||
# ANY KIND, either express or implied. See the License for the specific | ||
# language governing permissions and limitations under the License. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - we've made changes to support estimators, but what about models? If I have existing (local) model data, I should be able to deploy locally without training. Create an issue for this. This may already be supported - but I don't see tests for it.
I should be able to train two containers simultaneously on the same machine. I don't see a test for this. In future - can we add a test for that or explicitly forbid it. Create an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the first point, it is not entirely possible yet.
It is possible to train 2 containers simultaneously as they don't perform any action that is mutually exclusive. The exception to this would be training with GPU as there is a single hardware bottleneck.