Add Local Mode support #115

iquintero · 2018-03-30T23:09:10Z

When passing "local" as the instance type for any estimator,
training and deployment happens locally.

Similarly, using "local_gpu" will use nvidia-docker-compose and
work for GPU training.

Local Mode also works for Hosting and inference calls can be done locally.

When passing "local" as the instance type for any estimator, training and deployment happens locally. Similarly, using "local_gpu" will use nvidia-docker-compose and work for GPU training.

codecov-io · 2018-03-31T01:03:34Z

Codecov Report

Merging #115 into master will decrease coverage by 1.02%.
The diff coverage is 84.61%.

@@            Coverage Diff            @@
##           master    #115      +/-   ##
=========================================
- Coverage   91.42%   90.4%   -1.03%     
=========================================
  Files          34      36       +2     
  Lines        2064    2407     +343     
=========================================
+ Hits         1887    2176     +289     
- Misses        177     231      +54

Impacted Files	Coverage Δ
src/sagemaker/fw_utils.py	`98.52% <100%> (+0.04%)`	⬆️
src/sagemaker/__init__.py	`100% <100%> (ø)`	⬆️
src/sagemaker/estimator.py	`86.01% <100%> (+0.42%)`	⬆️
src/sagemaker/local/local_session.py	`78.16% <78.16%> (ø)`
src/sagemaker/session.py	`87.19% <80%> (-0.13%)`	⬇️
src/sagemaker/local/image.py	`85.89% <85.89%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f92fbd...f841623. Read the comment docs.

owen-t

Suggest adding a functional test

What's the story with windows support?

owen-t · 2018-03-31T00:19:51Z

src/sagemaker/fw_utils.py

-    if not instance_type.startswith('ml.'):
+    # Handle Local Mode
+    if instance_type.startswith('local'):
+        if instance_type == 'local':


device_type = 'cpu' if instance_type == 'local' else 'gpu'

you already know it's either 'local' or 'local_gpu'

owen-t · 2018-03-31T00:20:45Z

src/sagemaker/image.py

+logger.setLevel(logging.WARNING)
+
+
+class SageMakerContainer(object):


This should have an API doc

owen-t · 2018-03-31T00:20:53Z

src/sagemaker/image.py

+class SageMakerContainer(object):
+
+    def __init__(self, instance_type, instance_count, image, sagemaker_session=None):
+        from sagemaker.local_session import LocalSession


This should have an API doc

owen-t · 2018-03-31T00:26:28Z

src/sagemaker/image.py

+        # first.
+        for channel in input_data_config:
+            uri = channel['DataSource']['S3DataSource']['S3Uri']
+            key = uri[len(bucket_name_with_prefix):]


It would be more robust to use urlparse here

owen-t · 2018-03-31T00:28:19Z

src/sagemaker/image.py

+        """
+        self.container_root = _create_tmp_folder()
+
+        data_dir = tempfile.mkdtemp()


Problem here is that a customer might have different volumes. They should be able to specify a root for their data downloading.

For example - at home I have an SSD and a bunch of magnetic drives. This would almost certainly create a directory on my SSD, which I wouldn't want.

This may not be cleaned up by the OS of the python interpreter and may be large. This should be explicitly cleaned up after training.

owen-t · 2018-03-31T01:21:29Z

src/sagemaker/image.py

+    return os.path.abspath(dir)
+
+
+def _prepare_config_file_directory(root, host):


this creates the config directories. Can methods / functions in this file that do similar things be named in a similar way. E.g. if you are creating something - call it create_config_file_directory

owen-t · 2018-03-31T01:21:38Z

src/sagemaker/local_session.py

+logger.setLevel(logging.WARNING)
+
+
+class LocalSagemakerClient(object):


Add api docs. All public members should have API docs, won't repeat this per member.

owen-t · 2018-03-31T01:28:38Z

src/sagemaker/local_session.py

+        http = urllib3.PoolManager()
+        while True:
+            i += 1
+


remove this whitespace

owen-t · 2018-03-31T01:29:59Z

src/sagemaker/local_session.py

+        return {'Body': r, 'ContentType': Accept}
+
+
+class LocalSession(Session):


Add API docs

owen-t · 2018-03-31T01:30:12Z

src/sagemaker/session.py

@@ -756,6 +756,32 @@ def __init__(self, s3_data, distribution='FullyReplicated', compression=None,
            self.config['RecordWrapperType'] = record_wrapping


+class Inputs(object):


Add API docs

iquintero · 2018-04-01T00:04:47Z

I have just posted an update. I have added a configuration file

$HOME/.sagemaker/config.yaml

$ cat ~/.sagemaker/config.yaml
local:
container_root: /Users/nacho/container_root
serving_port: 8081

I will add docs regarding this to the README. It is completely optional and if it doesn't exist nothing will happen.

I improved the cleanup of the containers and in general got rid of some methods that didn't make sense. Now we have support to specify a root for all the container data.

owen-t

Not all comments need to be addressed now.

Add a comment saying Windows support is experimental.

If not before launch, then immediately after:

Add more functional tests that do training / serving concurrently - i.e. serve multiple things and train multiple things simultaneously.
Add functional tests for training / serving failures.
Write documentation - put this in readme for now, but this should be broken out when the documentation is refactored.

owen-t · 2018-04-01T00:05:04Z

src/sagemaker/image.py

+        # set the local config. This is optional and will use reasonable defaults
+        # if not present.
+        self.local_config = None
+        if self.sagemaker_session.config and 'local'in self.sagemaker_session.config:


add a space after 'local'

owen-t · 2018-04-01T00:06:16Z

src/sagemaker/image.py

+            hyperparameters (dict): The HyperParameters for the training job.
+
+        Returns (str): Location of the trained model.
+


remove this whitespace

owen-t · 2018-04-01T00:14:46Z

src/sagemaker/image.py

+        # mount the local directory to the container. For S3 Data we will download the S3 data
+        # first.
+        for channel in input_data_config:
+            uri = channel['DataSource']['S3DataSource']['S3Uri']


Discussed offline - we'll revisit this later.

owen-t · 2018-04-01T00:15:36Z

src/sagemaker/image.py

+        for host in self.hosts:
+            _create_config_file_directories(self.container_root, host)
+            self.write_config_files(host, hyperparameters, input_data_config)
+            shutil.copytree(data_dir, os.path.join(self.container_root, host, 'input', 'data'))


This goes against the semantics of ShardedByS3Key - so we shouldn't support that option in local mode OR this should be redesigned.

owen-t · 2018-04-01T01:33:41Z

src/sagemaker/image.py

+        for host in self.hosts:
+            _create_config_file_directories(self.container_root, host)
+            self.write_config_files(host, hyperparameters, input_data_config)
+            shutil.copytree(data_dir, os.path.join(self.container_root, host, 'input', 'data'))


In future - can we download S3 data straight into this folder to prevent copying. Can you create an issue for this.

owen-t · 2018-04-01T02:32:03Z

src/sagemaker/image.py

+
+        return s3_model_artifacts
+
+    def write_config_files(self, host, hyperparameters, input_data_config):


This should be a private method - it's unlikely a client is going to have a reason to call this. Also - the need to pass in a host argument is difficult. It can't be called at any time outside of a train or serve command - so make it private.

I agree with this, the only reason I made this public is because I wanted to have unit tests for it.

You can still test it. Putting a leading underscore is just convention in python.

owen-t · 2018-04-01T02:33:25Z

src/sagemaker/image.py

+
+        return content
+
+    def _compose(self, detached=False):


Can this be named _build_compose_command

owen-t · 2018-04-01T02:40:34Z

src/sagemaker/local_session.py

+        self.role_arn = None
+        self.created_endpoint = False
+
+    def create_training_job(self, TrainingJobName, AlgorithmSpecification, RoleArn, InputDataConfig, OutputDataConfig,


We need to decide if we want to support multiple training jobs or not.

If we expect customers to only do local training via estimators, then we don't. If we expect customers to use a LocalSession to do training, then we do. Create an issue for this, should frame the ambiguity.

The same applies to the hosting methods as well, for the different hosting entities - endpoint, model, endpoint config. Create an issue for this, again framing ambiguity.

Even if we only support one endpoint per session (as it is now), we need to be able to run multiple endpoints in different sessions concurrently. This means we need to flexible with ports in a way that we aren't now (I think). Create an issue for this.

You are right, the port as of now is configurable but its not flexible at runtime. This is one of the things I considered improving as it should be fairly transparent to the users

owen-t · 2018-04-01T02:43:13Z

src/sagemaker/estimator.py

@@ -78,7 +79,17 @@ def __init__(self, role, train_instance_count, train_instance_type,
        self.train_volume_size = train_volume_size
        self.train_max_run = train_max_run
        self.input_mode = input_mode
-        self.sagemaker_session = sagemaker_session or Session()
+


So - we've made changes to support estimators, but what about models? If I have existing (local) model data, I should be able to deploy locally without training. Create an issue for this. This may already be supported - but I don't see tests for it.

I should be able to train two containers simultaneously on the same machine. I don't see a test for this. In future - can we add a test for that or explicitly forbid it. Create an issue for this.

On the first point, it is not entirely possible yet.

It is possible to train 2 containers simultaneously as they don't perform any action that is mutually exclusive. The exception to this would be training with GPU as there is a single hardware bottleneck.

owen-t · 2018-04-01T02:48:41Z

src/sagemaker/estimator.py

+
+        if self.train_instance_type in ('local', 'local_gpu'):
+            self.local_mode = True
+            if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1:


Create an issue to support distributed local gpu in future, but that's going to be tough to do if the local machine only has a single gpu.

Also got rid of some of the docker cleanup stuff.

owen-t · 2018-04-01T17:50:42Z

src/sagemaker/local/image.py

+        return volumes
+
+    def _cleanup(self):
+        _check_output('docker network prune -f')


In future - let's keep track of the networks created and just remove those.

owen-t · 2018-04-01T20:40:22Z

Let's also open an issue for billable seconds - should be None

mvsusp

We need integration tests for this feature.

mvsusp · 2018-04-02T10:55:32Z

src/sagemaker/estimator.py

@@ -78,7 +79,17 @@ def __init__(self, role, train_instance_count, train_instance_type,
        self.train_volume_size = train_volume_size
        self.train_max_run = train_max_run
        self.input_mode = input_mode
-        self.sagemaker_session = sagemaker_session or Session()
+
+        if self.train_instance_type in ('local', 'local_gpu'):


I believe we should have a enums containing the available instance types and a description of them. What do you think?

mvsusp · 2018-04-02T10:56:28Z

src/sagemaker/estimator.py

+
+        if self.train_instance_type in ('local', 'local_gpu'):
+            self.local_mode = True
+            if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1:


You should use constants or the suggested enum instead of repeating the string here.

mvsusp · 2018-04-02T10:59:26Z

src/sagemaker/estimator.py

+        if self.train_instance_type in ('local', 'local_gpu'):
+            self.local_mode = True
+            if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1:
+                raise RuntimeError("Distributed Training in Local GPU is not supported")


Do we really want to throw an exception here? A more permissive alternative is to throw a warning educating the customer which scenarios that does not work and why.

mvsusp · 2018-04-02T11:01:18Z

src/sagemaker/estimator.py

+
+            self.sagemaker_session = LocalSession()
+        else:
+            self.local_mode = False


Do we need the self.local_mode given that the instance type determines it?
If that really is the case, let's rename it to _local_mode

mvsusp · 2018-04-02T11:05:30Z

src/sagemaker/local/image.py

+logger.setLevel(logging.WARNING)
+
+
+class _SageMakerContainer(object):


Should this class just be named _Container?

mvsusp · 2018-04-02T19:14:43Z

src/sagemaker/local/image.py

+            'command': command
+        }
+
+        serving_port = 8080 if self.local_config is None else self.local_config.get('serving_port', 8080)


You can revert this if:

serving_port = self.local_config.get('serving_port', 8080) if self.local_config else 8080

mvsusp · 2018-04-02T19:17:07Z

src/sagemaker/local/image.py

+        self.container_root = self._create_tmp_folder()
+        os.mkdir(os.path.join(self.container_root, 'output'))
+
+        data_dir = self._create_tmp_folder()


data_dir can be a normal tmp directory since is not going to be a docker volume.

mvsusp · 2018-04-02T19:17:43Z

src/sagemaker/local/image.py

+        Returns (str): Location of the trained model.
+        """
+        self.container_root = self._create_tmp_folder()
+        os.mkdir(os.path.join(self.container_root, 'output'))


Should we move the output folder creation inside the _create_tmp_folder?

mvsusp · 2018-04-02T19:43:51Z

src/sagemaker/local/image.py

+    process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
+    exit_code = None
+    while exit_code is None:
+        stdout = process.stdout.readline().decode("utf-8")


Does decode works with 2.7 and 3.6?

mvsusp · 2018-04-02T19:49:13Z

src/sagemaker/local/local_session.py

+        response = {'ResourceConfig': {'InstanceCount': self.train_container.instance_count},
+                    'TrainingJobStatus': 'Completed',
+                    'TrainingStartTime': datetime.datetime.now(),
+                    'TrainingEndTime': datetime.datetime.now(),


This another thing that we need to review soon. The timing and status returned should comply with the docker container execution.

mvsusp · 2018-04-02T20:04:22Z

We need tests for the following scenarios:
Async training
Attachment
Serving from a model in S3
Serving and training errors

Arpin r byo dkr restart

Add Local Mode support.

02dd3a7

When passing "local" as the instance type for any estimator, training and deployment happens locally. Similarly, using "local_gpu" will use nvidia-docker-compose and work for GPU training.

iquintero requested review from owen-t and winstonaws March 30, 2018 23:09

Ignacio Quintero added 3 commits March 30, 2018 17:39

Fix unit tests

c38f464

Fix train() unit test

967eb31

Fix reference to unicode in py3

5126211

owen-t reviewed Mar 31, 2018

View reviewed changes

Address feedback

6a48ddb

Ignacio Quintero added 2 commits March 31, 2018 17:12

Fix broken unit tests in Travis.

245cbbe

Fix flake8

95e99c7

owen-t suggested changes Apr 1, 2018

View reviewed changes

Ignacio Quintero added 3 commits April 1, 2018 09:03

Refactored local mode classes into their own package.

49dc17d

Fix training unit test.

3c84845

Add a runtime error for S3 Data Sharding.

f841623

Also got rid of some of the docker cleanup stuff.

owen-t reviewed Apr 1, 2018

View reviewed changes

owen-t approved these changes Apr 1, 2018

View reviewed changes

iquintero merged commit 6184b22 into aws:master Apr 1, 2018

mvsusp reviewed Apr 2, 2018

View reviewed changes

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018

Merge pull request aws#115 from awslabs/arpin_r_byo_dkr_restart

49455b6

Arpin r byo dkr restart

		logger.setLevel(logging.WARNING)


		class SageMakerContainer(object):

		return os.path.abspath(dir)


		def _prepare_config_file_directory(root, host):

		logger.setLevel(logging.WARNING)


		class LocalSagemakerClient(object):

		return {'Body': r, 'ContentType': Accept}


		class LocalSession(Session):

		@@ -756,6 +756,32 @@ def __init__(self, s3_data, distribution='FullyReplicated', compression=None,
		self.config['RecordWrapperType'] = record_wrapping


		class Inputs(object):

		hyperparameters (dict): The HyperParameters for the training job.

		Returns (str): Location of the trained model.


		return s3_model_artifacts

		def write_config_files(self, host, hyperparameters, input_data_config):

		logger.setLevel(logging.WARNING)


		class _SageMakerContainer(object):

Add Local Mode support #115

Add Local Mode support #115

Uh oh!

Conversation

iquintero commented Mar 30, 2018

Uh oh!

codecov-io commented Mar 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

owen-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iquintero commented Apr 1, 2018

Uh oh!

owen-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owen-t commented Apr 1, 2018

Uh oh!

mvsusp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov-io commented Mar 31, 2018 •

edited

Loading