Skip to content
This repository was archived by the owner on May 23, 2024. It is now read-only.

Add EI support to TFS container. #10

Merged
merged 6 commits into from
Mar 4, 2019
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,14 @@ The Docker images are built from the Dockerfiles in
[docker/](https://github.com/aws/sagemaker-tensorflow-serving-container/tree/master/docker>).

The Dockerfiles are grouped based on the version of TensorFlow Serving they support. Each supported
processor type (e.g. "cpu", "gpu") has a different Dockerfile in each group.
processor type (e.g. "cpu", "gpu", "ei") has a different Dockerfile in each group.

To build an image, run the `./scripts/build.sh` script:

```bash
./scripts/build.sh --version 1.11 --arch cpu
./scripts/build.sh --version 1.11 --arch gpu
./scripts/build.sh --version 1.11 --arch ei
```


Expand All @@ -67,6 +68,7 @@ in SageMaker, you need to publish it to an ECR repository in your account. The
```bash
./scripts/publish.sh --version 1.11 --arch cpu
./scripts/publish.sh --version 1.11 --arch gpu
./scripts/publish.sh --version 1.11 --arch ei
```

Note: this will publish to ECR in your default region. Use the `--region` argument to
Expand All @@ -80,8 +82,8 @@ GPU images) will work for this, or you can use the provided `start.sh`
and `stop.sh` scripts:

```bash
./scripts/start.sh [--version x.xx] [--arch cpu|gpu|...]
./scripts/stop.sh [--version x.xx] [--arch cpu|gpu|...]
./scripts/start.sh [--version x.xx] [--arch cpu|gpu|ei|...]
./scripts/stop.sh [--version x.xx] [--arch cpu|gpu|ei|...]
```

When the container is running, you can send test requests to it using any HTTP client. Here's
Expand All @@ -106,6 +108,15 @@ checkers using `tox`:
tox
```

To test Elastic Inference with Accelerator, you will need an AWS account, publish your built image to ECR repository and run the following command:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide an example of the command needed to run the test?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To test against Elastic Inference, you will..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update these as well as the above comments.


pytest test/functional/test_elastic_inference.py --aws-id <aws_account> \
--docker-base-name <ECR_repository_name> \
--instance-type <instance_type> \
--accelerator-type <accelerator_type> \
--tag <image_tag>


## Contributing

Please read [CONTRIBUTING.md](https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/CONTRIBUTING.md)
Expand Down
25 changes: 25 additions & 0 deletions docker/Dockerfile.ei
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
FROM ubuntu:16.04
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

ARG TFS_SHORT_VERSION

COPY AmazonEI_Tensorflow_Serving_v${TFS_SHORT_VERSION}_v1 /usr/bin/tensorflow_model_server

# downloaded 1.12 version is not executable
RUN chmod +x /usr/bin/tensorflow_model_server

# nginx + njs
RUN \
apt-get update && \
apt-get -y install --no-install-recommends curl && \
curl -s http://nginx.org/keys/nginx_signing.key | apt-key add - && \
echo 'deb http://nginx.org/packages/ubuntu/ xenial nginx' >> /etc/apt/sources.list && \
apt-get update && \
apt-get -y install --no-install-recommends nginx nginx-module-njs python3 python3-pip && \
apt-get clean

COPY ./ /
RUN rm AmazonEI_Tensorflow_Serving_v${TFS_SHORT_VERSION}_v1

ENV SAGEMAKER_TFS_VERSION "${TFS_SHORT_VERSION}"
ENV PATH "$PATH:/sagemaker"
4 changes: 4 additions & 0 deletions scripts/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ source scripts/shared.sh

parse_std_args "$@"

if [ $arch = 'ei' ]; then
get_tfs_executable
fi

echo "pulling previous image for layer cache... "
$(aws ecr get-login --no-include-email --registry-id $aws_account) &>/dev/null || echo 'warning: ecr login failed'
docker pull $aws_account.dkr.ecr.$aws_region.amazonaws.com/sagemaker-tensorflow-serving:$full_version-$arch &>/dev/null || echo 'warning: pull failed'
Expand Down
19 changes: 17 additions & 2 deletions scripts/shared.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

function error() {
>&2 echo $1
>&2 echo "usage: $0 [--version <major-version>] [--arch (cpu*|gpu)] [--region <aws-region>]"
>&2 echo "usage: $0 [--version <major-version>] [--arch (cpu*|gpu|ei)] [--region <aws-region>]"
exit 1
}

Expand All @@ -28,6 +28,21 @@ function get_aws_account() {
aws sts get-caller-identity --query 'Account' --output text
}

function get_tfs_executable() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function utilize the parsed args below for the TF version?

It also looks like the naming scheme isn't consistent for the zip file for 1.11 and 1.12 already, which might require updating this file often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only utilizes $version here, coz the naming for v1.11 and v1.12 are:
v1.11 -> Ubuntu -> Ubuntu.zip: Unbuntu/executable
v1.12 -> Ubuntu -> tfs_ei_v1_12_ubuntu.zip: v1_12_Ubuntu_2/executable

Even the v1.12's zip name and unzipped directory name are different. So I believe we will need to up date this file in the future.

Copy link

@ChoiByungWook ChoiByungWook Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like in S3 it is always going to follow the pattern of: s3://amazonei-tensorflow/Tensorflow\ Serving/{version}/Ubuntu.

We can do a discovery on the item for example with:
aws s3 ls s3://amazonei-tensorflow/Tensorflow\ Serving/v1.11/Ubuntu/ | awk '{print $4}'
which should return Ubuntu.zip

Afterwards, I think we can probably also discover the unzipped name as well. For example:
find . -name AmazonEI_TensorFlow_Serving_v${version}_v1* -exec mv {} container/ \;

There are probably better commands than the ones I used above in my examples.

# default to v1.12 in accordance with defaults below
s3_object='tfs_ei_v1_12_ubuntu'
unzipped='v1_12_Ubuntu'

if [ ${version} = '1.11' ]; then
s3_object='Ubuntu'
unzipped='Ubuntu'
fi

aws s3 cp 's3://amazonei-tensorflow/Tensorflow Serving/v'${version}'/Ubuntu/'${s3_object}'.zip' .
unzip ${s3_object} && mv ${unzipped}/AmazonEI_Tensorflow_Serving_v${version}_v1 container/
rm ${s3_object}.zip && rm -rf ${unzipped}
}

function parse_std_args() {
# defaults
arch='cpu'
Expand Down Expand Up @@ -63,7 +78,7 @@ function parse_std_args() {
done

[[ -z "${version// }" ]] && error 'missing version'
[[ "$arch" =~ ^(cpu|gpu)$ ]] || error "invalid arch: $arch"
[[ "$arch" =~ ^(cpu|gpu|ei)$ ]] || error "invalid arch: $arch"
[[ -z "${aws_region// }" ]] && error 'missing aws region'

full_version=$(get_full_version $version)
Expand Down
1 change: 0 additions & 1 deletion test/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,3 @@
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

100 changes: 100 additions & 0 deletions test/functional/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jesterhazy, do you care about this being under functional or integration/sagemaker?

I think most of our framework containers are under integration/sagemaker, except for the sagemaker-tensorflow-container repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it seems a bit funky to have both. the stuff in test/integration are local docker container tests, so renaming test/functional to test/sagemaker or renaming everything to test/integration/{local, sagemaker} would fit the convention better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename it to test/sagemaker

#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

import logging

import boto3
import pytest
from sagemaker import Session
from sagemaker.tensorflow import TensorFlow

logger = logging.getLogger(__name__)
logging.getLogger('boto').setLevel(logging.INFO)
logging.getLogger('botocore').setLevel(logging.INFO)
logging.getLogger('factory.py').setLevel(logging.INFO)
logging.getLogger('auth.py').setLevel(logging.INFO)
logging.getLogger('connectionpool.py').setLevel(logging.INFO)


def pytest_addoption(parser):
parser.addoption('--aws-id')
parser.addoption('--docker-base-name', default='sagemaker-tensorflow-serving')
parser.addoption('--instance-type')
parser.addoption('--accelerator-type', default=None)
parser.addoption('--region', default='us-west-2')
parser.addoption('--framework-version', default=TensorFlow.LATEST_VERSION)
parser.addoption('--processor', default='cpu', choices=['gpu', 'cpu'])
parser.addoption('--tag')


@pytest.fixture(scope='session')
def aws_id(request):
return request.config.getoption('--aws-id')


@pytest.fixture(scope='session')
def docker_base_name(request):
return request.config.getoption('--docker-base-name')


@pytest.fixture(scope='session')
def instance_type(request):
return request.config.getoption('--instance-type')


@pytest.fixture(scope='session')
def accelerator_type(request):
return request.config.getoption('--accelerator-type')


@pytest.fixture(scope='session')
def region(request):
return request.config.getoption('--region')


@pytest.fixture(scope='session')
def framework_version(request):
return request.config.getoption('--framework-version')


@pytest.fixture(scope='session')
def processor(request):
return request.config.getoption('--processor')


@pytest.fixture(scope='session')
def tag(request, framework_version, processor):
provided_tag = request.config.getoption('--tag')
default_tag = '{}-{}-py2'.format(framework_version, processor)
return provided_tag if provided_tag is not None else default_tag


@pytest.fixture(scope='session')
def docker_registry(aws_id, region):
return '{}.dkr.ecr.{}.amazonaws.com'.format(aws_id, region)


@pytest.fixture(scope='module')
def docker_image(docker_base_name, tag):
return '{}:{}'.format(docker_base_name, tag)


@pytest.fixture(scope='module')
def docker_image_uri(docker_registry, docker_image):
uri = '{}/{}'.format(docker_registry, docker_image)
return uri


@pytest.fixture(scope='session')
def sagemaker_session(region):
return Session(boto_session=boto3.Session(region_name=region))
68 changes: 68 additions & 0 deletions test/functional/test_elastic_inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

import logging
import numpy as np

import pytest
from sagemaker.tensorflow.serving import Model
from sagemaker.utils import sagemaker_timestamp

EI_SUPPORTED_REGIONS = ['us-east-1', 'us-east-2', 'us-west-2', 'eu-west-1', 'ap-northeast-1', 'ap-northeast-2']

logger = logging.getLogger(__name__)
logging.getLogger('boto3').setLevel(logging.INFO)
logging.getLogger('botocore').setLevel(logging.INFO)
logging.getLogger('factory.py').setLevel(logging.INFO)
logging.getLogger('auth.py').setLevel(logging.INFO)
logging.getLogger('connectionpool.py').setLevel(logging.INFO)
logging.getLogger('session.py').setLevel(logging.DEBUG)
logging.getLogger('sagemaker').setLevel(logging.DEBUG)


@pytest.fixture(autouse=True)
def skip_if_no_accelerator(accelerator_type):
if accelerator_type is None:
pytest.skip('Skipping because accelerator type was not provided')


@pytest.fixture(autouse=True)
def skip_if_non_supported_ei_region(region):
if region not in EI_SUPPORTED_REGIONS:
pytest.skip('EI is not supported in {}'.format(region))


@pytest.fixture
def pretrained_model_data(region):
return 's3://sagemaker-sample-data-{}/tensorflow/model/resnet/resnet_50_v2_fp32_NCHW.tar.gz'.format(region)


@pytest.mark.skip_if_non_supported_ei_region
@pytest.mark.skip_if_no_accelerator
def test_deploy_elastic_inference_with_pretrained_model(pretrained_model_data, docker_image_uri, sagemaker_session, instance_type, accelerator_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • endpoint needs to be removed at end of test
  • not excited about taking a sagemaker sdk dependency for this. can we change this to use boto3 only? it shouldn't be too complicated since the model is already in s3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh, gotcha!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endpoint_name = 'test-tf-ei-deploy-model-{}'.format(sagemaker_timestamp())

tensorflow_model = Model(model_data=pretrained_model_data,
role='SageMakerRole',
image=docker_image_uri,
sagemaker_session=sagemaker_session)

logger.info('deploying model to endpoint: {}'.format(endpoint_name))
predictor = tensorflow_model.deploy(initial_instance_count=1,
instance_type=instance_type,
accelerator_type=accelerator_type,
endpoint_name=endpoint_name)

input_data = np.random.rand(1, 1, 3, 3)
result = predictor.predict(input_data)
assert result
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ require-code = True
# Can be used to specify which tests to run, e.g.: tox -- -s
basepython = python3
commands =
python -m pytest {posargs}
python -m pytest {posargs} --ignore=test/functional
deps =
pytest
requests
Expand Down