-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add Local Mode support #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When passing "local" as the instance type for any estimator, training and deployment happens locally. Similarly, using "local_gpu" will use nvidia-docker-compose and work for GPU training.
Codecov Report
@@ Coverage Diff @@
## master #115 +/- ##
=========================================
- Coverage 91.42% 90.4% -1.03%
=========================================
Files 34 36 +2
Lines 2064 2407 +343
=========================================
+ Hits 1887 2176 +289
- Misses 177 231 +54
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest adding a functional test
What's the story with windows support?
src/sagemaker/fw_utils.py
Outdated
if not instance_type.startswith('ml.'): | ||
# Handle Local Mode | ||
if instance_type.startswith('local'): | ||
if instance_type == 'local': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_type = 'cpu' if instance_type == 'local' else 'gpu'
you already know it's either 'local' or 'local_gpu'
src/sagemaker/image.py
Outdated
logger.setLevel(logging.WARNING) | ||
|
||
|
||
class SageMakerContainer(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have an API doc
src/sagemaker/image.py
Outdated
class SageMakerContainer(object): | ||
|
||
def __init__(self, instance_type, instance_count, image, sagemaker_session=None): | ||
from sagemaker.local_session import LocalSession |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have an API doc
src/sagemaker/image.py
Outdated
# first. | ||
for channel in input_data_config: | ||
uri = channel['DataSource']['S3DataSource']['S3Uri'] | ||
key = uri[len(bucket_name_with_prefix):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more robust to use urlparse here
src/sagemaker/image.py
Outdated
""" | ||
self.container_root = _create_tmp_folder() | ||
|
||
data_dir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem here is that a customer might have different volumes. They should be able to specify a root for their data downloading.
For example - at home I have an SSD and a bunch of magnetic drives. This would almost certainly create a directory on my SSD, which I wouldn't want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be cleaned up by the OS of the python interpreter and may be large. This should be explicitly cleaned up after training.
src/sagemaker/image.py
Outdated
return os.path.abspath(dir) | ||
|
||
|
||
def _prepare_config_file_directory(root, host): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this creates the config directories. Can methods / functions in this file that do similar things be named in a similar way. E.g. if you are creating something - call it create_config_file_directory
src/sagemaker/local_session.py
Outdated
logger.setLevel(logging.WARNING) | ||
|
||
|
||
class LocalSagemakerClient(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add api docs. All public members should have API docs, won't repeat this per member.
src/sagemaker/local_session.py
Outdated
http = urllib3.PoolManager() | ||
while True: | ||
i += 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this whitespace
src/sagemaker/local_session.py
Outdated
return {'Body': r, 'ContentType': Accept} | ||
|
||
|
||
class LocalSession(Session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add API docs
src/sagemaker/session.py
Outdated
@@ -756,6 +756,32 @@ def __init__(self, s3_data, distribution='FullyReplicated', compression=None, | |||
self.config['RecordWrapperType'] = record_wrapping | |||
|
|||
|
|||
class Inputs(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add API docs
I have just posted an update. I have added a configuration file $HOME/.sagemaker/config.yaml $ cat ~/.sagemaker/config.yaml I will add docs regarding this to the README. It is completely optional and if it doesn't exist nothing will happen. I improved the cleanup of the containers and in general got rid of some methods that didn't make sense. Now we have support to specify a root for all the container data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all comments need to be addressed now.
Add a comment saying Windows support is experimental.
If not before launch, then immediately after:
- Add more functional tests that do training / serving concurrently - i.e. serve multiple things and train multiple things simultaneously.
- Add functional tests for training / serving failures.
- Write documentation - put this in readme for now, but this should be broken out when the documentation is refactored.
src/sagemaker/image.py
Outdated
# set the local config. This is optional and will use reasonable defaults | ||
# if not present. | ||
self.local_config = None | ||
if self.sagemaker_session.config and 'local'in self.sagemaker_session.config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a space after 'local'
src/sagemaker/image.py
Outdated
hyperparameters (dict): The HyperParameters for the training job. | ||
|
||
Returns (str): Location of the trained model. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this whitespace
src/sagemaker/image.py
Outdated
# mount the local directory to the container. For S3 Data we will download the S3 data | ||
# first. | ||
for channel in input_data_config: | ||
uri = channel['DataSource']['S3DataSource']['S3Uri'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline - we'll revisit this later.
src/sagemaker/image.py
Outdated
for host in self.hosts: | ||
_create_config_file_directories(self.container_root, host) | ||
self.write_config_files(host, hyperparameters, input_data_config) | ||
shutil.copytree(data_dir, os.path.join(self.container_root, host, 'input', 'data')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes against the semantics of ShardedByS3Key - so we shouldn't support that option in local mode OR this should be redesigned.
src/sagemaker/image.py
Outdated
for host in self.hosts: | ||
_create_config_file_directories(self.container_root, host) | ||
self.write_config_files(host, hyperparameters, input_data_config) | ||
shutil.copytree(data_dir, os.path.join(self.container_root, host, 'input', 'data')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future - can we download S3 data straight into this folder to prevent copying. Can you create an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/sagemaker/image.py
Outdated
|
||
return s3_model_artifacts | ||
|
||
def write_config_files(self, host, hyperparameters, input_data_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a private method - it's unlikely a client is going to have a reason to call this. Also - the need to pass in a host argument is difficult. It can't be called at any time outside of a train or serve command - so make it private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, the only reason I made this public is because I wanted to have unit tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still test it. Putting a leading underscore is just convention in python.
src/sagemaker/image.py
Outdated
|
||
return content | ||
|
||
def _compose(self, detached=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be named _build_compose_command
src/sagemaker/local_session.py
Outdated
self.role_arn = None | ||
self.created_endpoint = False | ||
|
||
def create_training_job(self, TrainingJobName, AlgorithmSpecification, RoleArn, InputDataConfig, OutputDataConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to decide if we want to support multiple training jobs or not.
If we expect customers to only do local training via estimators, then we don't. If we expect customers to use a LocalSession to do training, then we do. Create an issue for this, should frame the ambiguity.
The same applies to the hosting methods as well, for the different hosting entities - endpoint, model, endpoint config. Create an issue for this, again framing ambiguity.
Even if we only support one endpoint per session (as it is now), we need to be able to run multiple endpoints in different sessions concurrently. This means we need to flexible with ports in a way that we aren't now (I think). Create an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the port as of now is configurable but its not flexible at runtime. This is one of the things I considered improving as it should be fairly transparent to the users
@@ -78,7 +79,17 @@ def __init__(self, role, train_instance_count, train_instance_type, | |||
self.train_volume_size = train_volume_size | |||
self.train_max_run = train_max_run | |||
self.input_mode = input_mode | |||
self.sagemaker_session = sagemaker_session or Session() | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - we've made changes to support estimators, but what about models? If I have existing (local) model data, I should be able to deploy locally without training. Create an issue for this. This may already be supported - but I don't see tests for it.
I should be able to train two containers simultaneously on the same machine. I don't see a test for this. In future - can we add a test for that or explicitly forbid it. Create an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the first point, it is not entirely possible yet.
It is possible to train 2 containers simultaneously as they don't perform any action that is mutually exclusive. The exception to this would be training with GPU as there is a single hardware bottleneck.
|
||
if self.train_instance_type in ('local', 'local_gpu'): | ||
self.local_mode = True | ||
if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create an issue to support distributed local gpu in future, but that's going to be tough to do if the local machine only has a single gpu.
Also got rid of some of the docker cleanup stuff.
return volumes | ||
|
||
def _cleanup(self): | ||
_check_output('docker network prune -f') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future - let's keep track of the networks created and just remove those.
Let's also open an issue for billable seconds - should be None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need integration tests for this feature.
@@ -78,7 +79,17 @@ def __init__(self, role, train_instance_count, train_instance_type, | |||
self.train_volume_size = train_volume_size | |||
self.train_max_run = train_max_run | |||
self.input_mode = input_mode | |||
self.sagemaker_session = sagemaker_session or Session() | |||
|
|||
if self.train_instance_type in ('local', 'local_gpu'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should have a enums containing the available instance types and a description of them. What do you think?
|
||
if self.train_instance_type in ('local', 'local_gpu'): | ||
self.local_mode = True | ||
if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use constants or the suggested enum instead of repeating the string here.
if self.train_instance_type in ('local', 'local_gpu'): | ||
self.local_mode = True | ||
if self.train_instance_type == 'local_gpu' and self.train_instance_count > 1: | ||
raise RuntimeError("Distributed Training in Local GPU is not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to throw an exception here? A more permissive alternative is to throw a warning educating the customer which scenarios that does not work and why.
|
||
self.sagemaker_session = LocalSession() | ||
else: | ||
self.local_mode = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the self.local_mode
given that the instance type determines it?
If that really is the case, let's rename it to _local_mode
logger.setLevel(logging.WARNING) | ||
|
||
|
||
class _SageMakerContainer(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this class just be named _Container
?
'command': command | ||
} | ||
|
||
serving_port = 8080 if self.local_config is None else self.local_config.get('serving_port', 8080) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can revert this if:
serving_port = self.local_config.get('serving_port', 8080) if self.local_config else 8080
self.container_root = self._create_tmp_folder() | ||
os.mkdir(os.path.join(self.container_root, 'output')) | ||
|
||
data_dir = self._create_tmp_folder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_dir can be a normal tmp directory since is not going to be a docker volume.
Returns (str): Location of the trained model. | ||
""" | ||
self.container_root = self._create_tmp_folder() | ||
os.mkdir(os.path.join(self.container_root, 'output')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move the output folder creation inside the _create_tmp_folder
?
process = subprocess.Popen(cmd, stdout=subprocess.PIPE) | ||
exit_code = None | ||
while exit_code is None: | ||
stdout = process.stdout.readline().decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does decode works with 2.7 and 3.6?
response = {'ResourceConfig': {'InstanceCount': self.train_container.instance_count}, | ||
'TrainingJobStatus': 'Completed', | ||
'TrainingStartTime': datetime.datetime.now(), | ||
'TrainingEndTime': datetime.datetime.now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This another thing that we need to review soon. The timing and status returned should comply with the docker container execution.
We need tests for the following scenarios: |
Arpin r byo dkr restart
When passing "local" as the instance type for any estimator,
training and deployment happens locally.
Similarly, using "local_gpu" will use nvidia-docker-compose and
work for GPU training.
Local Mode also works for Hosting and inference calls can be done locally.