Let Framework models reuse code uploaded by Framework estimators #230

humanzz · 2018-06-14T11:50:44Z

This addresses problem 1 in Unexpected and dangerous behaviours with MXNet estimator's code uploading #226
This relies on the tar_and_upload_dir's behaviour when given S3 paths for directory
This updates MXNet, TensorFlow and Chainer frameworks

Issue #, if available: #226

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have updated the changelog with a description of my changes (if appropriate)
I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov-io · 2018-06-14T11:53:59Z

Codecov Report

Merging #230 into master will decrease coverage by 0.24%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #230      +/-   ##
==========================================
- Coverage   92.53%   92.29%   -0.25%     
==========================================
  Files          49       49              
  Lines        3257     3259       +2     
==========================================
- Hits         3014     3008       -6     
- Misses        243      251       +8

Impacted Files	Coverage Δ
src/sagemaker/model.py	`95.45% <ø> (ø)`	⬆️
src/sagemaker/tensorflow/estimator.py	`96.24% <100%> (ø)`	⬆️
src/sagemaker/estimator.py	`85.77% <100%> (+0.12%)`	⬆️
src/sagemaker/chainer/estimator.py	`100% <100%> (ø)`	⬆️
src/sagemaker/pytorch/estimator.py	`100% <100%> (ø)`	⬆️
src/sagemaker/mxnet/estimator.py	`100% <100%> (ø)`	⬆️
src/sagemaker/local/image.py	`86.43% <0%> (-2.36%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92eb47d...a7c4ca6. Read the comment docs.

laurenyu

nice! thanks for submitting this!

humanzz · 2018-06-20T06:19:12Z

@laurenyu great that you approved this. I was trying to be helpful and updating the branch for merge but github seems to think I ignored your previous approval - also the teamcity didn't seem to be super happy - not sure whether it was because of the merge failure or something else.
Let me know if I can do anything to help this get merged soon.

laurenyu · 2018-06-20T16:35:49Z

@humanzz GitHub dismisses approval whenever there's a new commit.

The TeamCity tests failed for the local mode tests - when running local mode, there's no uploaded code, so it wasn't finding self.uploaded_code -

======= Failed test run #2 ==========
tests/integ/test_local_mode.py:336 (test_mxnet_local_data_local_script)
def test_mxnet_local_data_local_script():
        local_mode_lock_fd = open(LOCK_PATH, 'w')
        local_mode_lock = local_mode_lock_fd.fileno()
    
        script_path = os.path.join(DATA_DIR, 'mxnet_mnist', 'mnist.py')
        data_path = os.path.join(DATA_DIR, 'mxnet_mnist')
    
        mx = MXNet(entry_point=script_path, role='SageMakerRole',
                   train_instance_count=1, train_instance_type='local',
                   sagemaker_session=LocalNoS3Session())
    
        train_input = 'file://' + os.path.join(data_path, 'train')
        test_input = 'file://' + os.path.join(data_path, 'test')
    
        mx.fit({'train': train_input, 'test': test_input})
        endpoint_name = mx.latest_training_job.name
        try:
            # Since Local Mode uses the same port for serving, we need a lock in order
            # to allow concurrent test execution. The serving test is really fast so it still
            # makes sense to allow this behavior.
            fcntl.lockf(local_mode_lock, fcntl.LOCK_EX)
>           predictor = mx.deploy(1, 'local', endpoint_name=endpoint_name)

tests/integ/test_local_mode.py:358: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sagemaker.mxnet.estimator.MXNet object at 0x7f16023b7940>
initial_instance_count = 1, instance_type = 'local'
endpoint_name = 'sagemaker-mxnet-2018-06-20-01-43-06-918', kwargs = {}

    def deploy(self, initial_instance_count, instance_type, endpoint_name=None, **kwargs):
        """Deploy the trained model to an Amazon SageMaker endpoint and return a ``sagemaker.RealTimePredictor`` object.
    
            More information:
            http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
    
            Args:
                initial_instance_count (int): Minimum number of EC2 instances to deploy to an endpoint for prediction.
                instance_type (str): Type of EC2 instance to deploy to an endpoint for prediction,
                    for example, 'ml.c4.xlarge'.
                endpoint_name (str): Name to use for creating an Amazon SageMaker endpoint. If not specified, the name of
                    the training job is used.
                **kwargs: Passed to invocation of ``create_model()``. Implementations may customize
                    ``create_model()`` to accept ``**kwargs`` to customize model creation during deploy.
                    For more, see the implementation docs.
    
            Returns:
                sagemaker.predictor.RealTimePredictor: A predictor that provides a ``predict()`` method,
                    which can be used to send requests to the Amazon SageMaker endpoint and obtain inferences.
            """
        if not self.latest_training_job:
            raise RuntimeError('Estimator has not been fit yet.')
        endpoint_name = endpoint_name or self.latest_training_job.name
        self.deploy_instance_type = instance_type
>       return self.create_model(**kwargs).deploy(
            instance_type=instance_type,
            initial_instance_count=initial_instance_count,
            endpoint_name=endpoint_name)

.tox/py35/lib/python3.5/site-packages/sagemaker/estimator.py:259: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sagemaker.mxnet.estimator.MXNet object at 0x7f16023b7940>
model_server_workers = None

    def create_model(self, model_server_workers=None):
        """Create a SageMaker ``MXNetModel`` object that can be deployed to an ``Endpoint``.
    
            Args:
                model_server_workers (int): Optional. The number of worker processes used by the inference server.
                    If None, server will use one worker per vCPU.
    
            Returns:
                sagemaker.mxnet.model.MXNetModel: A SageMaker ``MXNetModel`` object.
                    See :func:`~sagemaker.mxnet.model.MXNetModel` for full details.
            """
>       return MXNetModel(self.model_data, self.role, self.entry_point, source_dir=self.uploaded_code.s3_prefix,
                          enable_cloudwatch_metrics=self.enable_cloudwatch_metrics, name=self._current_job_name,
                          container_log_level=self.container_log_level, code_location=self.code_location,
                          py_version=self.py_version, framework_version=self.framework_version,
                          model_server_workers=model_server_workers, sagemaker_session=self.sagemaker_session)
E       AttributeError: 'MXNet' object has no attribute 'uploaded_code'

.tox/py35/lib/python3.5/site-packages/sagemaker/mxnet/estimator.py:85: AttributeError

During handling of the above exception, another exception occurred:

    def test_mxnet_local_data_local_script():
        local_mode_lock_fd = open(LOCK_PATH, 'w')
        local_mode_lock = local_mode_lock_fd.fileno()
    
        script_path = os.path.join(DATA_DIR, 'mxnet_mnist', 'mnist.py')
        data_path = os.path.join(DATA_DIR, 'mxnet_mnist')
    
        mx = MXNet(entry_point=script_path, role='SageMakerRole',
                   train_instance_count=1, train_instance_type='local',
                   sagemaker_session=LocalNoS3Session())
    
        train_input = 'file://' + os.path.join(data_path, 'train')
        test_input = 'file://' + os.path.join(data_path, 'test')
    
        mx.fit({'train': train_input, 'test': test_input})
        endpoint_name = mx.latest_training_job.name
        try:
            # Since Local Mode uses the same port for serving, we need a lock in order
            # to allow concurrent test execution. The serving test is really fast so it still
            # makes sense to allow this behavior.
            fcntl.lockf(local_mode_lock, fcntl.LOCK_EX)
            predictor = mx.deploy(1, 'local', endpoint_name=endpoint_name)
            data = numpy.zeros(shape=(1, 1, 28, 28))
            predictor.predict(data)
        finally:
>           mx.delete_endpoint()

tests/integ/test_local_mode.py:362: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py35/lib/python3.5/site-packages/sagemaker/estimator.py:318: in delete_endpoint
    self.sagemaker_session.delete_endpoint(self.latest_training_job.name)
.tox/py35/lib/python3.5/site-packages/sagemaker/session.py:489: in delete_endpoint
    self.sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sagemaker.local.local_session.LocalSagemakerClient object at 0x7f16023b7828>
EndpointName = 'sagemaker-mxnet-2018-06-20-01-43-06-918'

    def delete_endpoint(self, EndpointName):
>       self.serve_container.stop_serving()
E       AttributeError: 'NoneType' object has no attribute 'stop_serving'

.tox/py35/lib/python3.5/site-packages/sagemaker/local/local_session.py:142: AttributeError
------- Stdout: -------
Attaching to tmpqvn42alz_algo-1-25Q2V_1
Creating tmpqvn42alz_algo-1-25Q2V_1 ... 
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,175 INFO - root - running container entrypoint
 [1A [2K
Creating tmpqvn42alz_algo-1-25Q2V_1 ...  [32mdone [0m
 [1B [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,175 INFO - root - starting train task
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,181 INFO - container_support.training - Training starting
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,576 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'enable_cloudwatch_metrics': False, 'available_gpus': 0, 'channels': {u'test': {u'ContentType': u'application/octet-stream'}, u'train': {u'ContentType': u'application/octet-stream'}}, '_ps_verbose': 0, 'resource_config': {u'current_host': u'algo-1-25Q2V', u'hosts': [u'algo-1-25Q2V']}, 'user_script_name': u'mnist.py', 'input_config_dir': '/opt/ml/input/config', 'channel_dirs': {u'test': u'/opt/ml/input/data/test', u'train': u'/opt/ml/input/data/train'}, 'code_dir': '/opt/ml/code', 'output_data_dir': '/opt/ml/output/data/', 'output_dir': '/opt/ml/output', 'model_dir': '/opt/ml/model', 'hyperparameters': {u'sagemaker_program': u'mnist.py', u'sagemaker_submit_directory': u'file:///home/ec2-user/Agent/work/9caddf8323cb3b11/tests/integ/../data/mxnet_mnist', u'sagemaker_region': u'us-west-2', u'sagemaker_enable_cloudwatch_metrics': False, u'sagemaker_job_name': u'sagemaker-mxnet-2018-06-20-01-43-06-918', u'sagemaker_container_log_level': 20}, 'hosts': [u'algo-1-25Q2V'], 'job_name': None, '_ps_port': 8000, 'user_script_archive': u'file:///home/ec2-user/Agent/work/9caddf8323cb3b11/tests/integ/../data/mxnet_mnist', '_scheduler_host': u'algo-1-25Q2V', 'sagemaker_region': u'us-west-2', '_scheduler_ip': '172.19.0.2', 'input_dir': '/opt/ml/input', 'user_requirements_file': None, 'current_host': u'algo-1-25Q2V', 'container_log_level': 20, 'available_cpus': 4, 'base_dir': '/opt/ml'}
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,576 INFO - mxnet_container.train - Starting distributed training task
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:10,491 INFO - root - Epoch[0] Batch [100]    Speed: 16322.96 samples/sec    accuracy=0.113465
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:11,262 INFO - root - Epoch[0] Batch [200]    Speed: 12979.41 samples/sec    accuracy=0.111500
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:12,275 INFO - root - Epoch[0] Batch [300]    Speed: 9874.14 samples/sec    accuracy=0.112700
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:13,165 INFO - root - Epoch[0] Batch [400]    Speed: 11237.80 samples/sec    accuracy=0.115000
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:13,610 INFO - root - Epoch[0] Batch [500]    Speed: 22482.83 samples/sec    accuracy=0.131300
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:13,902 INFO - root - Epoch[0] Train-accuracy=0.219394
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:13,902 INFO - root - Epoch[0] Time cost=4.066
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:14,088 INFO - root - Epoch[0] Validation-accuracy=0.318600
 [36mtmpqvn42alz_algo-1-25Q2V_1 exited with code 0
 [0mAborting on container exit...
===== Job Complete =====

------- Stderr: -------
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): iam.amazonaws.com
INFO:sagemaker:Creating training-job with name: sagemaker-mxnet-2018-06-20-01-43-06-918
INFO:sagemaker:Deleting endpoint with name: sagemaker-mxnet-2018-06-20-01-43-06-918

That was an oversight on my part as well. Can you add some logic to handle the local mode case?

- This addresses problem 1 in #226 - This relies on the tar_and_upload_dir's behaviour when given S3 paths for directory - This updates MXNet, TensorFlow and Chainer frameworks

humanzz · 2018-06-20T19:42:22Z

@laurenyu I've pushed a change that uses the correct value for source_dir based on whether the sagemaker session is a local one or not. I think this should work, but in lack of the ability to run the integ tests, I'm not 100% sure. Also, I'm not sure about the second error so I am hoping the main fix will make it go away.

This brings me to the question: is there any way for me to run the integ tests? or to at least view the teamcity results without having to force you to paste the details here to just let me know it's not working?

laurenyu · 2018-06-20T21:08:16Z

src/sagemaker/estimator.py

@@ -579,6 +579,14 @@ def _stage_user_code_in_s3(self):
                                  script=self.entry_point,
                                  directory=self.source_dir)

+    def _model_source_dir(self):
+        """ Gets the appropriate value to pass as source_dir to model constructor on deploying


remove the space after the """

Remove unnecessary comment space

laurenyu · 2018-06-20T21:18:03Z

you can run the integ tests locally as long as you have AWS credentials set up for your own account - there's some more detail here: https://github.com/aws/sagemaker-python-sdk#getting-sagemaker-python-sdk

humanzz · 2018-06-20T21:52:26Z

@laurenyu travis-ci seem to be stuck and blocking teamcity. It'd be great if you can force it as I cannot.

laurenyu · 2018-06-20T23:10:03Z

maybe something was just hanging for a bit? I didn't touch Travis/TC but it seems to have worked

laurenyu · 2018-06-20T23:12:30Z

CHANGELOG.rst

@@ -10,6 +10,7 @@ CHANGELOG
 * feature: Allow Local Serving of Models in S3
 * enhancement: Allow option for ``HyperparameterTuner`` to not include estimator metadata in job
 * bug-fix: Estimators: Join tensorboard thread after fitting
+* enhancement: Let Framework models reuse code uploaded by Framework estimators


actually, sorry to nitpick - this was just sort of unfortunate timing - but we just released v1.5.0. could you make a new version entry in the changelog for this?

humanzz · 2018-06-21T06:10:38Z

@laurenyu unfortunate but fine. I've updated the CHANGELOG and added support to the newly released PyTorch as well.

Let's hope we can merge this in before any other major updates happen :)

I hope you also get a chance to have a look at the releated PR at #228 as it hasn't got anyone's attention yet. With it, we can close #226

- Update CHANGELOG entry to be under 1.5.1dev - Handle the newly added PyTorch framework

humanzz · 2018-06-21T12:23:55Z

I've also sent a PR at #246 to slightly improve the README for running tests. I feel it's not specific enough as in what should the permissions for the credentials and the role be to ensure tests can run but I felt it's at least better than the lack of any mentions about these prerequisites.

Fix typo

This change adds import statements in the __init__ files of the `sagemaker` package and the `sklearn` package to make it easier for users to import Processor classes.

humanzz mentioned this pull request Jun 14, 2018

Unexpected and dangerous behaviours with MXNet estimator's code uploading #226

Closed

laurenyu previously approved these changes Jun 20, 2018

View reviewed changes

humanzz dismissed laurenyu’s stale review via 7cb7b0d June 20, 2018 06:14

Let Framework models reuse code uploaded by Framework estimators

22b7cea

- This addresses problem 1 in #226 - This relies on the tar_and_upload_dir's behaviour when given S3 paths for directory - This updates MXNet, TensorFlow and Chainer frameworks

laurenyu reviewed Jun 20, 2018

View reviewed changes

Fix formatting

c45e14f

Remove unnecessary comment space

laurenyu previously approved these changes Jun 20, 2018

View reviewed changes

Merge branch 'master' into model-no-duplicate-code-upload

e857d07

laurenyu reviewed Jun 20, 2018

View reviewed changes

humanzz dismissed laurenyu’s stale review via a37028c June 21, 2018 06:08

Update to cope with new 1.5.0 release

a7c4ca6

- Update CHANGELOG entry to be under 1.5.1dev - Handle the newly added PyTorch framework

laurenyu approved these changes Jun 21, 2018

View reviewed changes

laurenyu merged commit 3feb2eb into aws:master Jun 21, 2018

humanzz deleted the model-no-duplicate-code-upload branch June 21, 2018 22:46

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018

Merge pull request aws#230 from matthieudelaro/patch-1

e43ba90

Fix typo

knakad pushed a commit that referenced this pull request Dec 4, 2019

Import Processors in init files (#230)

507999f

This change adds import statements in the __init__ files of the `sagemaker` package and the `sklearn` package to make it easier for users to import Processor classes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Let Framework models reuse code uploaded by Framework estimators #230

Let Framework models reuse code uploaded by Framework estimators #230

Uh oh!

humanzz commented Jun 14, 2018

Uh oh!

codecov-io commented Jun 14, 2018 •

edited

Loading

Uh oh!

laurenyu left a comment

Uh oh!

humanzz commented Jun 20, 2018

Uh oh!

laurenyu commented Jun 20, 2018

Uh oh!

humanzz commented Jun 20, 2018 •

edited

Loading

Uh oh!

laurenyu Jun 20, 2018

Uh oh!

laurenyu commented Jun 20, 2018

Uh oh!

humanzz commented Jun 20, 2018

Uh oh!

laurenyu commented Jun 20, 2018

Uh oh!

laurenyu Jun 20, 2018

Uh oh!

humanzz commented Jun 21, 2018

Uh oh!

humanzz commented Jun 21, 2018

Uh oh!

Uh oh!

Let Framework models reuse code uploaded by Framework estimators #230

Let Framework models reuse code uploaded by Framework estimators #230

Uh oh!

Conversation

humanzz commented Jun 14, 2018

Merge Checklist

Uh oh!

codecov-io commented Jun 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

laurenyu left a comment

Choose a reason for hiding this comment

Uh oh!

humanzz commented Jun 20, 2018

Uh oh!

laurenyu commented Jun 20, 2018

Uh oh!

humanzz commented Jun 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laurenyu Jun 20, 2018

Choose a reason for hiding this comment

Uh oh!

laurenyu commented Jun 20, 2018

Uh oh!

humanzz commented Jun 20, 2018

Uh oh!

laurenyu commented Jun 20, 2018

Uh oh!

laurenyu Jun 20, 2018

Choose a reason for hiding this comment

Uh oh!

humanzz commented Jun 21, 2018

Uh oh!

humanzz commented Jun 21, 2018

Uh oh!

Uh oh!

codecov-io commented Jun 14, 2018 •

edited

Loading

humanzz commented Jun 20, 2018 •

edited

Loading