Skip to content

Let Framework models reuse code uploaded by Framework estimators #230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 21, 2018
Merged

Let Framework models reuse code uploaded by Framework estimators #230

merged 4 commits into from
Jun 21, 2018

Conversation

humanzz
Copy link
Contributor

@humanzz humanzz commented Jun 14, 2018

Issue #, if available: #226

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have updated the changelog with a description of my changes (if appropriate)
  • I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov-io
Copy link

codecov-io commented Jun 14, 2018

Codecov Report

Merging #230 into master will decrease coverage by 0.24%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #230      +/-   ##
==========================================
- Coverage   92.53%   92.29%   -0.25%     
==========================================
  Files          49       49              
  Lines        3257     3259       +2     
==========================================
- Hits         3014     3008       -6     
- Misses        243      251       +8
Impacted Files Coverage Δ
src/sagemaker/model.py 95.45% <ø> (ø) ⬆️
src/sagemaker/tensorflow/estimator.py 96.24% <100%> (ø) ⬆️
src/sagemaker/estimator.py 85.77% <100%> (+0.12%) ⬆️
src/sagemaker/chainer/estimator.py 100% <100%> (ø) ⬆️
src/sagemaker/pytorch/estimator.py 100% <100%> (ø) ⬆️
src/sagemaker/mxnet/estimator.py 100% <100%> (ø) ⬆️
src/sagemaker/local/image.py 86.43% <0%> (-2.36%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92eb47d...a7c4ca6. Read the comment docs.

laurenyu
laurenyu previously approved these changes Jun 20, 2018
Copy link
Contributor

@laurenyu laurenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! thanks for submitting this!

@humanzz
Copy link
Contributor Author

humanzz commented Jun 20, 2018

@laurenyu great that you approved this. I was trying to be helpful and updating the branch for merge but github seems to think I ignored your previous approval - also the teamcity didn't seem to be super happy - not sure whether it was because of the merge failure or something else.
Let me know if I can do anything to help this get merged soon.

@laurenyu
Copy link
Contributor

@humanzz GitHub dismisses approval whenever there's a new commit.

The TeamCity tests failed for the local mode tests - when running local mode, there's no uploaded code, so it wasn't finding self.uploaded_code -

======= Failed test run #2 ==========
tests/integ/test_local_mode.py:336 (test_mxnet_local_data_local_script)
def test_mxnet_local_data_local_script():
        local_mode_lock_fd = open(LOCK_PATH, 'w')
        local_mode_lock = local_mode_lock_fd.fileno()
    
        script_path = os.path.join(DATA_DIR, 'mxnet_mnist', 'mnist.py')
        data_path = os.path.join(DATA_DIR, 'mxnet_mnist')
    
        mx = MXNet(entry_point=script_path, role='SageMakerRole',
                   train_instance_count=1, train_instance_type='local',
                   sagemaker_session=LocalNoS3Session())
    
        train_input = 'file://' + os.path.join(data_path, 'train')
        test_input = 'file://' + os.path.join(data_path, 'test')
    
        mx.fit({'train': train_input, 'test': test_input})
        endpoint_name = mx.latest_training_job.name
        try:
            # Since Local Mode uses the same port for serving, we need a lock in order
            # to allow concurrent test execution. The serving test is really fast so it still
            # makes sense to allow this behavior.
            fcntl.lockf(local_mode_lock, fcntl.LOCK_EX)
>           predictor = mx.deploy(1, 'local', endpoint_name=endpoint_name)

tests/integ/test_local_mode.py:358: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sagemaker.mxnet.estimator.MXNet object at 0x7f16023b7940>
initial_instance_count = 1, instance_type = 'local'
endpoint_name = 'sagemaker-mxnet-2018-06-20-01-43-06-918', kwargs = {}

    def deploy(self, initial_instance_count, instance_type, endpoint_name=None, **kwargs):
        """Deploy the trained model to an Amazon SageMaker endpoint and return a ``sagemaker.RealTimePredictor`` object.
    
            More information:
            http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
    
            Args:
                initial_instance_count (int): Minimum number of EC2 instances to deploy to an endpoint for prediction.
                instance_type (str): Type of EC2 instance to deploy to an endpoint for prediction,
                    for example, 'ml.c4.xlarge'.
                endpoint_name (str): Name to use for creating an Amazon SageMaker endpoint. If not specified, the name of
                    the training job is used.
                **kwargs: Passed to invocation of ``create_model()``. Implementations may customize
                    ``create_model()`` to accept ``**kwargs`` to customize model creation during deploy.
                    For more, see the implementation docs.
    
            Returns:
                sagemaker.predictor.RealTimePredictor: A predictor that provides a ``predict()`` method,
                    which can be used to send requests to the Amazon SageMaker endpoint and obtain inferences.
            """
        if not self.latest_training_job:
            raise RuntimeError('Estimator has not been fit yet.')
        endpoint_name = endpoint_name or self.latest_training_job.name
        self.deploy_instance_type = instance_type
>       return self.create_model(**kwargs).deploy(
            instance_type=instance_type,
            initial_instance_count=initial_instance_count,
            endpoint_name=endpoint_name)

.tox/py35/lib/python3.5/site-packages/sagemaker/estimator.py:259: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sagemaker.mxnet.estimator.MXNet object at 0x7f16023b7940>
model_server_workers = None

    def create_model(self, model_server_workers=None):
        """Create a SageMaker ``MXNetModel`` object that can be deployed to an ``Endpoint``.
    
            Args:
                model_server_workers (int): Optional. The number of worker processes used by the inference server.
                    If None, server will use one worker per vCPU.
    
            Returns:
                sagemaker.mxnet.model.MXNetModel: A SageMaker ``MXNetModel`` object.
                    See :func:`~sagemaker.mxnet.model.MXNetModel` for full details.
            """
>       return MXNetModel(self.model_data, self.role, self.entry_point, source_dir=self.uploaded_code.s3_prefix,
                          enable_cloudwatch_metrics=self.enable_cloudwatch_metrics, name=self._current_job_name,
                          container_log_level=self.container_log_level, code_location=self.code_location,
                          py_version=self.py_version, framework_version=self.framework_version,
                          model_server_workers=model_server_workers, sagemaker_session=self.sagemaker_session)
E       AttributeError: 'MXNet' object has no attribute 'uploaded_code'

.tox/py35/lib/python3.5/site-packages/sagemaker/mxnet/estimator.py:85: AttributeError

During handling of the above exception, another exception occurred:

    def test_mxnet_local_data_local_script():
        local_mode_lock_fd = open(LOCK_PATH, 'w')
        local_mode_lock = local_mode_lock_fd.fileno()
    
        script_path = os.path.join(DATA_DIR, 'mxnet_mnist', 'mnist.py')
        data_path = os.path.join(DATA_DIR, 'mxnet_mnist')
    
        mx = MXNet(entry_point=script_path, role='SageMakerRole',
                   train_instance_count=1, train_instance_type='local',
                   sagemaker_session=LocalNoS3Session())
    
        train_input = 'file://' + os.path.join(data_path, 'train')
        test_input = 'file://' + os.path.join(data_path, 'test')
    
        mx.fit({'train': train_input, 'test': test_input})
        endpoint_name = mx.latest_training_job.name
        try:
            # Since Local Mode uses the same port for serving, we need a lock in order
            # to allow concurrent test execution. The serving test is really fast so it still
            # makes sense to allow this behavior.
            fcntl.lockf(local_mode_lock, fcntl.LOCK_EX)
            predictor = mx.deploy(1, 'local', endpoint_name=endpoint_name)
            data = numpy.zeros(shape=(1, 1, 28, 28))
            predictor.predict(data)
        finally:
>           mx.delete_endpoint()

tests/integ/test_local_mode.py:362: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py35/lib/python3.5/site-packages/sagemaker/estimator.py:318: in delete_endpoint
    self.sagemaker_session.delete_endpoint(self.latest_training_job.name)
.tox/py35/lib/python3.5/site-packages/sagemaker/session.py:489: in delete_endpoint
    self.sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sagemaker.local.local_session.LocalSagemakerClient object at 0x7f16023b7828>
EndpointName = 'sagemaker-mxnet-2018-06-20-01-43-06-918'

    def delete_endpoint(self, EndpointName):
>       self.serve_container.stop_serving()
E       AttributeError: 'NoneType' object has no attribute 'stop_serving'

.tox/py35/lib/python3.5/site-packages/sagemaker/local/local_session.py:142: AttributeError
------- Stdout: -------
Attaching to tmpqvn42alz_algo-1-25Q2V_1
Creating tmpqvn42alz_algo-1-25Q2V_1 ... 
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,175 INFO - root - running container entrypoint
 [1A [2K
Creating tmpqvn42alz_algo-1-25Q2V_1 ...  [32mdone [0m
 [1B [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,175 INFO - root - starting train task
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,181 INFO - container_support.training - Training starting
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,576 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'enable_cloudwatch_metrics': False, 'available_gpus': 0, 'channels': {u'test': {u'ContentType': u'application/octet-stream'}, u'train': {u'ContentType': u'application/octet-stream'}}, '_ps_verbose': 0, 'resource_config': {u'current_host': u'algo-1-25Q2V', u'hosts': [u'algo-1-25Q2V']}, 'user_script_name': u'mnist.py', 'input_config_dir': '/opt/ml/input/config', 'channel_dirs': {u'test': u'/opt/ml/input/data/test', u'train': u'/opt/ml/input/data/train'}, 'code_dir': '/opt/ml/code', 'output_data_dir': '/opt/ml/output/data/', 'output_dir': '/opt/ml/output', 'model_dir': '/opt/ml/model', 'hyperparameters': {u'sagemaker_program': u'mnist.py', u'sagemaker_submit_directory': u'file:///home/ec2-user/Agent/work/9caddf8323cb3b11/tests/integ/../data/mxnet_mnist', u'sagemaker_region': u'us-west-2', u'sagemaker_enable_cloudwatch_metrics': False, u'sagemaker_job_name': u'sagemaker-mxnet-2018-06-20-01-43-06-918', u'sagemaker_container_log_level': 20}, 'hosts': [u'algo-1-25Q2V'], 'job_name': None, '_ps_port': 8000, 'user_script_archive': u'file:///home/ec2-user/Agent/work/9caddf8323cb3b11/tests/integ/../data/mxnet_mnist', '_scheduler_host': u'algo-1-25Q2V', 'sagemaker_region': u'us-west-2', '_scheduler_ip': '172.19.0.2', 'input_dir': '/opt/ml/input', 'user_requirements_file': None, 'current_host': u'algo-1-25Q2V', 'container_log_level': 20, 'available_cpus': 4, 'base_dir': '/opt/ml'}
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:08,576 INFO - mxnet_container.train - Starting distributed training task
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:10,491 INFO - root - Epoch[0] Batch [100]    Speed: 16322.96 samples/sec    accuracy=0.113465
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:11,262 INFO - root - Epoch[0] Batch [200]    Speed: 12979.41 samples/sec    accuracy=0.111500
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:12,275 INFO - root - Epoch[0] Batch [300]    Speed: 9874.14 samples/sec    accuracy=0.112700
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:13,165 INFO - root - Epoch[0] Batch [400]    Speed: 11237.80 samples/sec    accuracy=0.115000
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:13,610 INFO - root - Epoch[0] Batch [500]    Speed: 22482.83 samples/sec    accuracy=0.131300
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:13,902 INFO - root - Epoch[0] Train-accuracy=0.219394
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:13,902 INFO - root - Epoch[0] Time cost=4.066
 [36malgo-1-25Q2V_1  | [0m 2018-06-20 01:43:14,088 INFO - root - Epoch[0] Validation-accuracy=0.318600
 [36mtmpqvn42alz_algo-1-25Q2V_1 exited with code 0
 [0mAborting on container exit...
===== Job Complete =====

------- Stderr: -------
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): iam.amazonaws.com
INFO:sagemaker:Creating training-job with name: sagemaker-mxnet-2018-06-20-01-43-06-918
INFO:sagemaker:Deleting endpoint with name: sagemaker-mxnet-2018-06-20-01-43-06-918

That was an oversight on my part as well. Can you add some logic to handle the local mode case?

- This addresses problem 1 in #226
- This relies on the tar_and_upload_dir's behaviour when given S3 paths for directory
- This updates MXNet, TensorFlow and Chainer frameworks
@humanzz
Copy link
Contributor Author

humanzz commented Jun 20, 2018

@laurenyu I've pushed a change that uses the correct value for source_dir based on whether the sagemaker session is a local one or not. I think this should work, but in lack of the ability to run the integ tests, I'm not 100% sure. Also, I'm not sure about the second error so I am hoping the main fix will make it go away.

This brings me to the question: is there any way for me to run the integ tests? or to at least view the teamcity results without having to force you to paste the details here to just let me know it's not working?

@@ -579,6 +579,14 @@ def _stage_user_code_in_s3(self):
script=self.entry_point,
directory=self.source_dir)

def _model_source_dir(self):
""" Gets the appropriate value to pass as source_dir to model constructor on deploying
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the space after the """

Remove unnecessary comment space
@laurenyu
Copy link
Contributor

you can run the integ tests locally as long as you have AWS credentials set up for your own account - there's some more detail here: https://github.com/aws/sagemaker-python-sdk#getting-sagemaker-python-sdk

laurenyu
laurenyu previously approved these changes Jun 20, 2018
@humanzz
Copy link
Contributor Author

humanzz commented Jun 20, 2018

@laurenyu travis-ci seem to be stuck and blocking teamcity. It'd be great if you can force it as I cannot.

@laurenyu
Copy link
Contributor

maybe something was just hanging for a bit? I didn't touch Travis/TC but it seems to have worked

CHANGELOG.rst Outdated
@@ -10,6 +10,7 @@ CHANGELOG
* feature: Allow Local Serving of Models in S3
* enhancement: Allow option for ``HyperparameterTuner`` to not include estimator metadata in job
* bug-fix: Estimators: Join tensorboard thread after fitting
* enhancement: Let Framework models reuse code uploaded by Framework estimators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, sorry to nitpick - this was just sort of unfortunate timing - but we just released v1.5.0. could you make a new version entry in the changelog for this?

@humanzz
Copy link
Contributor Author

humanzz commented Jun 21, 2018

@laurenyu unfortunate but fine. I've updated the CHANGELOG and added support to the newly released PyTorch as well.

Let's hope we can merge this in before any other major updates happen :)

I hope you also get a chance to have a look at the releated PR at #228 as it hasn't got anyone's attention yet. With it, we can close #226

- Update CHANGELOG entry to be under 1.5.1dev
- Handle the newly added PyTorch framework
@humanzz
Copy link
Contributor Author

humanzz commented Jun 21, 2018

I've also sent a PR at #246 to slightly improve the README for running tests. I feel it's not specific enough as in what should the permissions for the credentials and the role be to ensure tests can run but I felt it's at least better than the lack of any mentions about these prerequisites.

@laurenyu laurenyu merged commit 3feb2eb into aws:master Jun 21, 2018
@humanzz humanzz deleted the model-no-duplicate-code-upload branch June 21, 2018 22:46
apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018
nadiaya pushed a commit to nadiaya/sagemaker-python-sdk that referenced this pull request Nov 25, 2019
This change adds import statements in the __init__ files of the `sagemaker` package and the `sklearn` package to make it easier for users to import Processor classes.
knakad pushed a commit to knakad/sagemaker-python-sdk that referenced this pull request Dec 4, 2019
This change adds import statements in the __init__ files of the `sagemaker` package and the `sklearn` package to make it easier for users to import Processor classes.
knakad pushed a commit that referenced this pull request Dec 4, 2019
This change adds import statements in the __init__ files of the `sagemaker` package and the `sklearn` package to make it easier for users to import Processor classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants