Skip to content

Add Local Mode support #115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 1, 2018
Merged

Add Local Mode support #115

merged 10 commits into from
Apr 1, 2018

Conversation

iquintero
Copy link
Contributor

When passing "local" as the instance type for any estimator,
training and deployment happens locally.

Similarly, using "local_gpu" will use nvidia-docker-compose and
work for GPU training.

Local Mode also works for Hosting and inference calls can be done locally.

When passing "local" as the instance type for any estimator,
training and deployment happens locally.

Similarly, using "local_gpu" will use nvidia-docker-compose and
work for GPU training.
@iquintero iquintero requested review from owen-t and winstonaws March 30, 2018 23:09
@codecov-io
Copy link

codecov-io commented Mar 31, 2018

Codecov Report

Merging #115 into master will decrease coverage by 1.02%.
The diff coverage is 84.61%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #115      +/-   ##
=========================================
- Coverage   91.42%   90.4%   -1.03%     
=========================================
  Files          34      36       +2     
  Lines        2064    2407     +343     
=========================================
+ Hits         1887    2176     +289     
- Misses        177     231      +54
Impacted Files Coverage Δ
src/sagemaker/fw_utils.py 98.52% <100%> (+0.04%) ⬆️
src/sagemaker/__init__.py 100% <100%> (ø) ⬆️
src/sagemaker/estimator.py 86.01% <100%> (+0.42%) ⬆️
src/sagemaker/local/local_session.py 78.16% <78.16%> (ø)
src/sagemaker/session.py 87.19% <80%> (-0.13%) ⬇️
src/sagemaker/local/image.py 85.89% <85.89%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f92fbd...f841623. Read the comment docs.

Copy link
Contributor

@owen-t owen-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a functional test

What's the story with windows support?

if not instance_type.startswith('ml.'):
# Handle Local Mode
if instance_type.startswith('local'):
if instance_type == 'local':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_type = 'cpu' if instance_type == 'local' else 'gpu'

you already know it's either 'local' or 'local_gpu'

logger.setLevel(logging.WARNING)


class SageMakerContainer(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have an API doc

class SageMakerContainer(object):

def __init__(self, instance_type, instance_count, image, sagemaker_session=None):
from sagemaker.local_session import LocalSession
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have an API doc

# first.
for channel in input_data_config:
uri = channel['DataSource']['S3DataSource']['S3Uri']
key = uri[len(bucket_name_with_prefix):]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more robust to use urlparse here

"""
self.container_root = _create_tmp_folder()

data_dir = tempfile.mkdtemp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem here is that a customer might have different volumes. They should be able to specify a root for their data downloading.

For example - at home I have an SSD and a bunch of magnetic drives. This would almost certainly create a directory on my SSD, which I wouldn't want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be cleaned up by the OS of the python interpreter and may be large. This should be explicitly cleaned up after training.

return os.path.abspath(dir)


def _prepare_config_file_directory(root, host):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this creates the config directories. Can methods / functions in this file that do similar things be named in a similar way. E.g. if you are creating something - call it create_config_file_directory

logger.setLevel(logging.WARNING)


class LocalSagemakerClient(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add api docs. All public members should have API docs, won't repeat this per member.

http = urllib3.PoolManager()
while True:
i += 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this whitespace

return {'Body': r, 'ContentType': Accept}


class LocalSession(Session):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add API docs

@@ -756,6 +756,32 @@ def __init__(self, s3_data, distribution='FullyReplicated', compression=None,
self.config['RecordWrapperType'] = record_wrapping


class Inputs(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add API docs

@iquintero
Copy link
Contributor Author

I have just posted an update. I have added a configuration file

$HOME/.sagemaker/config.yaml

$ cat ~/.sagemaker/config.yaml
local:
container_root: /Users/nacho/container_root
serving_port: 8081

I will add docs regarding this to the README. It is completely optional and if it doesn't exist nothing will happen.

I improved the cleanup of the containers and in general got rid of some methods that didn't make sense. Now we have support to specify a root for all the container data.

Copy link
Contributor

@owen-t owen-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all comments need to be addressed now.

Add a comment saying Windows support is experimental.

If not before launch, then immediately after:

  • Add more functional tests that do training / serving concurrently - i.e. serve multiple things and train multiple things simultaneously.
  • Add functional tests for training / serving failures.
  • Write documentation - put this in readme for now, but this should be broken out when the documentation is refactored.

# set the local config. This is optional and will use reasonable defaults
# if not present.
self.local_config = None
if self.sagemaker_session.config and 'local'in self.sagemaker_session.config:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a space after 'local'

hyperparameters (dict): The HyperParameters for the training job.

Returns (str): Location of the trained model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this whitespace

# mount the local directory to the container. For S3 Data we will download the S3 data
# first.
for channel in input_data_config:
uri = channel['DataSource']['S3DataSource']['S3Uri']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - we'll revisit this later.

for host in self.hosts:
_create_config_file_directories(self.container_root, host)
self.write_config_files(host, hyperparameters, input_data_config)
shutil.copytree(data_dir, os.path.join(self.container_root, host, 'input', 'data'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes against the semantics of ShardedByS3Key - so we shouldn't support that option in local mode OR this should be redesigned.

for host in self.hosts:
_create_config_file_directories(self.container_root, host)
self.write_config_files(host, hyperparameters, input_data_config)
shutil.copytree(data_dir, os.path.join(self.container_root, host, 'input', 'data'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future - can we download S3 data straight into this folder to prevent copying. Can you create an issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


return s3_model_artifacts

def write_config_files(self, host, hyperparameters, input_data_config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a private method - it's unlikely a client is going to have a reason to call this. Also - the need to pass in a host argument is difficult. It can't be called at any time outside of a train or serve command - so make it private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, the only reason I made this public is because I wanted to have unit tests for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still test it. Putting a leading underscore is just convention in python.


return content

def _compose(self, detached=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be named _build_compose_command

self.role_arn = None
self.created_endpoint = False

def create_training_job(self, TrainingJobName, AlgorithmSpecification, RoleArn, InputDataConfig, OutputDataConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide if we want to support multiple training jobs or not.

If we expect customers to only do local training via estimators, then we don't. If we expect customers to use a LocalSession to do training, then we do. Create an issue for this, should frame the ambiguity.

The same applies to the hosting methods as well, for the different hosting entities - endpoint, model, endpoint config. Create an issue for this, again framing ambiguity.

Even if we only support one endpoint per session (as it is now), we need to be able to run multiple endpoints in different sessions concurrently. This means we need to flexible with ports in a way that we aren't now (I think). Create an issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the port as of now is configurable but its not flexible at runtime. This is one of the things I considered improving as it should be fairly transparent to the users

@@ -78,7 +79,17 @@ def __init__(self, role, train_instance_count, train_instance_type,
self.train_volume_size = train_volume_size
self.train_max_run = train_max_run
self.input_mode = input_mode
self.sagemaker_session = sagemaker_session or Session()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - we've made changes to support estimators, but what about models? If I have existing (local) model data, I should be able to deploy locally without training. Create an issue for this. This may already be supported - but I don't see tests for it.

I should be able to train two containers simultaneously on the same machine. I don't see a test for this. In future - can we add a test for that or explicitly forbid it. Create an issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first point, it is not entirely possible yet.

It is possible to train 2 containers simultaneously as they don't perform any action that is mutually exclusive. The exception to this would be training with GPU as there is a single hardware bottleneck.


if self.train_instance_type in ('local', 'local_gpu'):
self.local_mode = True
if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an issue to support distributed local gpu in future, but that's going to be tough to do if the local machine only has a single gpu.

return volumes

def _cleanup(self):
_check_output('docker network prune -f')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future - let's keep track of the networks created and just remove those.

@owen-t
Copy link
Contributor

owen-t commented Apr 1, 2018

Let's also open an issue for billable seconds - should be None

@iquintero iquintero merged commit 6184b22 into aws:master Apr 1, 2018
Copy link
Contributor

@mvsusp mvsusp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need integration tests for this feature.

@@ -78,7 +79,17 @@ def __init__(self, role, train_instance_count, train_instance_type,
self.train_volume_size = train_volume_size
self.train_max_run = train_max_run
self.input_mode = input_mode
self.sagemaker_session = sagemaker_session or Session()

if self.train_instance_type in ('local', 'local_gpu'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should have a enums containing the available instance types and a description of them. What do you think?


if self.train_instance_type in ('local', 'local_gpu'):
self.local_mode = True
if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use constants or the suggested enum instead of repeating the string here.

if self.train_instance_type in ('local', 'local_gpu'):
self.local_mode = True
if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1:
raise RuntimeError("Distributed Training in Local GPU is not supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to throw an exception here? A more permissive alternative is to throw a warning educating the customer which scenarios that does not work and why.


self.sagemaker_session = LocalSession()
else:
self.local_mode = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the self.local_mode given that the instance type determines it?
If that really is the case, let's rename it to _local_mode

logger.setLevel(logging.WARNING)


class _SageMakerContainer(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class just be named _Container?

'command': command
}

serving_port = 8080 if self.local_config is None else self.local_config.get('serving_port', 8080)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can revert this if:

serving_port = self.local_config.get('serving_port', 8080) if self.local_config else 8080

self.container_root = self._create_tmp_folder()
os.mkdir(os.path.join(self.container_root, 'output'))

data_dir = self._create_tmp_folder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_dir can be a normal tmp directory since is not going to be a docker volume.

Returns (str): Location of the trained model.
"""
self.container_root = self._create_tmp_folder()
os.mkdir(os.path.join(self.container_root, 'output'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the output folder creation inside the _create_tmp_folder?

process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
exit_code = None
while exit_code is None:
stdout = process.stdout.readline().decode("utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does decode works with 2.7 and 3.6?

response = {'ResourceConfig': {'InstanceCount': self.train_container.instance_count},
'TrainingJobStatus': 'Completed',
'TrainingStartTime': datetime.datetime.now(),
'TrainingEndTime': datetime.datetime.now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This another thing that we need to review soon. The timing and status returned should comply with the docker container execution.

@mvsusp
Copy link
Contributor

mvsusp commented Apr 2, 2018

We need tests for the following scenarios:
Async training
Attachment
Serving from a model in S3
Serving and training errors

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants