-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add Local Mode support #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
02dd3a7
c38f464
967eb31
5126211
6a48ddb
245cbbe
95e99c7
49dc17d
3c84845
f841623
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ | |
from sagemaker.fw_utils import tar_and_upload_dir | ||
from sagemaker.fw_utils import parse_s3_url | ||
from sagemaker.fw_utils import UploadedCode | ||
from sagemaker.local_session import LocalSession | ||
from sagemaker.model import Model | ||
from sagemaker.model import (SCRIPT_PARAM_NAME, DIR_PARAM_NAME, CLOUDWATCH_METRICS_PARAM_NAME, | ||
CONTAINER_LOG_LEVEL_PARAM_NAME, JOB_NAME_PARAM_NAME, SAGEMAKER_REGION_PARAM_NAME) | ||
|
@@ -78,7 +79,17 @@ def __init__(self, role, train_instance_count, train_instance_type, | |
self.train_volume_size = train_volume_size | ||
self.train_max_run = train_max_run | ||
self.input_mode = input_mode | ||
self.sagemaker_session = sagemaker_session or Session() | ||
|
||
if self.train_instance_type in ('local', 'local_gpu'): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe we should have a enums containing the available instance types and a description of them. What do you think? |
||
self.local_mode = True | ||
if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Create an issue to support distributed local gpu in future, but that's going to be tough to do if the local machine only has a single gpu. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should use constants or the suggested enum instead of repeating the string here. |
||
raise RuntimeError("Distributed Training in Local GPU is not supported") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really want to throw an exception here? A more permissive alternative is to throw a warning educating the customer which scenarios that does not work and why. |
||
|
||
self.sagemaker_session = LocalSession() | ||
else: | ||
self.local_mode = False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need the |
||
self.sagemaker_session = sagemaker_session or Session() | ||
|
||
self.base_job_name = base_job_name | ||
self._current_job_name = None | ||
self.output_path = output_path | ||
|
@@ -303,7 +314,7 @@ def start_new(cls, estimator, inputs): | |
"""Create a new Amazon SageMaker training job from the estimator. | ||
|
||
Args: | ||
estimator (sagemaker.estimator.Framework): Estimator object created by the user. | ||
estimator (sagemaker.estimator.EstimatorBase): Estimator object created by the user. | ||
inputs (str): Parameters used when called :meth:`~sagemaker.estimator.EstimatorBase.fit`. | ||
|
||
Returns: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,19 +45,27 @@ def create_image_uri(region, framework, instance_type, framework_version, py_ver | |
str: The appropriate image URI based on the given parameters. | ||
""" | ||
|
||
if not instance_type.startswith('ml.'): | ||
# Handle Local Mode | ||
if instance_type.startswith('local'): | ||
if instance_type == 'local': | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
you already know it's either 'local' or 'local_gpu' |
||
device_type = 'cpu' | ||
elif instance_type == 'local_gpu': | ||
device_type = 'gpu' | ||
|
||
elif not instance_type.startswith('ml.'): | ||
raise ValueError('{} is not a valid SageMaker instance type. See: ' | ||
'https://aws.amazon.com/sagemaker/pricing/instance-types/'.format(instance_type)) | ||
family = instance_type.split('.')[1] | ||
|
||
# For some frameworks, we have optimized images for specific families, e.g c5 or p3. In those cases, | ||
# we use the family name in the image tag. In other cases, we use 'cpu' or 'gpu'. | ||
if family in optimized_families: | ||
device_type = family | ||
elif family[0] in ['g', 'p']: | ||
device_type = 'gpu' | ||
else: | ||
device_type = 'cpu' | ||
family = instance_type.split('.')[1] | ||
|
||
# For some frameworks, we have optimized images for specific families, e.g c5 or p3. In those cases, | ||
# we use the family name in the image tag. In other cases, we use 'cpu' or 'gpu'. | ||
if family in optimized_families: | ||
device_type = family | ||
elif family[0] in ['g', 'p']: | ||
device_type = 'gpu' | ||
else: | ||
device_type = 'cpu' | ||
|
||
tag = "{}-{}-{}".format(framework_version, device_type, py_version) | ||
return "{}.dkr.ecr.{}.amazonaws.com/sagemaker-{}:{}" \ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - we've made changes to support estimators, but what about models? If I have existing (local) model data, I should be able to deploy locally without training. Create an issue for this. This may already be supported - but I don't see tests for it.
I should be able to train two containers simultaneously on the same machine. I don't see a test for this. In future - can we add a test for that or explicitly forbid it. Create an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the first point, it is not entirely possible yet.
It is possible to train 2 containers simultaneously as they don't perform any action that is mutually exclusive. The exception to this would be training with GPU as there is a single hardware bottleneck.