Skip to content
This repository was archived by the owner on May 23, 2024. It is now read-only.

Commit db12d18

Browse files
authored
Add EI support to TFS container. (#10)
* Introduce Dockerfile.ei to support EI and modify scripts. * Incorporate build script. * Add functional test and update README. * Only get executable when --arch is ei. * Update integ test using boto3, update shared.sh to discover file names. * Update README example, refactor test.
1 parent a92b896 commit db12d18

File tree

8 files changed

+288
-7
lines changed

8 files changed

+288
-7
lines changed

README.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,13 +50,14 @@ The Docker images are built from the Dockerfiles in
5050
[docker/](https://github.com/aws/sagemaker-tensorflow-serving-container/tree/master/docker>).
5151

5252
The Dockerfiles are grouped based on the version of TensorFlow Serving they support. Each supported
53-
processor type (e.g. "cpu", "gpu") has a different Dockerfile in each group.
53+
processor type (e.g. "cpu", "gpu", "ei") has a different Dockerfile in each group.
5454

5555
To build an image, run the `./scripts/build.sh` script:
5656

5757
```bash
5858
./scripts/build.sh --version 1.11 --arch cpu
5959
./scripts/build.sh --version 1.11 --arch gpu
60+
./scripts/build.sh --version 1.11 --arch ei
6061
```
6162

6263

@@ -67,6 +68,7 @@ in SageMaker, you need to publish it to an ECR repository in your account. The
6768
```bash
6869
./scripts/publish.sh --version 1.11 --arch cpu
6970
./scripts/publish.sh --version 1.11 --arch gpu
71+
./scripts/publish.sh --version 1.11 --arch ei
7072
```
7173

7274
Note: this will publish to ECR in your default region. Use the `--region` argument to
@@ -80,8 +82,8 @@ GPU images) will work for this, or you can use the provided `start.sh`
8082
and `stop.sh` scripts:
8183

8284
```bash
83-
./scripts/start.sh [--version x.xx] [--arch cpu|gpu|...]
84-
./scripts/stop.sh [--version x.xx] [--arch cpu|gpu|...]
85+
./scripts/start.sh [--version x.xx] [--arch cpu|gpu|ei|...]
86+
./scripts/stop.sh [--version x.xx] [--arch cpu|gpu|ei|...]
8587
```
8688

8789
When the container is running, you can send test requests to it using any HTTP client. Here's
@@ -106,6 +108,22 @@ checkers using `tox`:
106108
tox
107109
```
108110

111+
To test against Elastic Inference with Accelerator, you will need an AWS account, publish your built image to ECR repository and run the following command:
112+
113+
pytest test/sagemaker/test_elastic_inference.py --aws-id <aws_account> \
114+
--docker-base-name <ECR_repository_name> \
115+
--instance-type <instance_type> \
116+
--accelerator-type <accelerator_type> \
117+
--tag <image_tag>
118+
119+
For example:
120+
121+
pytest test/sagemaker/test_elastic_inference.py --aws-id 0123456789012 \
122+
--docker-base-name sagemaker-tensorflow-serving \
123+
--instance_type ml.m4.xlarge \
124+
--accelerator-type ml.eia1.large \
125+
--tag 1.12.0-ei
126+
109127
## Contributing
110128

111129
Please read [CONTRIBUTING.md](https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/CONTRIBUTING.md)

docker/Dockerfile.ei

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
FROM ubuntu:16.04
2+
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
3+
4+
ARG TFS_SHORT_VERSION
5+
6+
COPY AmazonEI_TensorFlow_Serving_v${TFS_SHORT_VERSION}_v1 /usr/bin/tensorflow_model_server
7+
8+
# downloaded 1.12 version is not executable
9+
RUN chmod +x /usr/bin/tensorflow_model_server
10+
11+
# nginx + njs
12+
RUN \
13+
apt-get update && \
14+
apt-get -y install --no-install-recommends curl && \
15+
curl -s http://nginx.org/keys/nginx_signing.key | apt-key add - && \
16+
echo 'deb http://nginx.org/packages/ubuntu/ xenial nginx' >> /etc/apt/sources.list && \
17+
apt-get update && \
18+
apt-get -y install --no-install-recommends nginx nginx-module-njs python3 python3-pip && \
19+
apt-get clean
20+
21+
COPY ./ /
22+
RUN rm AmazonEI_TensorFlow_Serving_v${TFS_SHORT_VERSION}_v1
23+
24+
ENV SAGEMAKER_TFS_VERSION "${TFS_SHORT_VERSION}"
25+
ENV PATH "$PATH:/sagemaker"

scripts/build.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@ source scripts/shared.sh
88

99
parse_std_args "$@"
1010

11+
if [ $arch = 'ei' ]; then
12+
get_tfs_executable
13+
fi
14+
1115
echo "pulling previous image for layer cache... "
1216
$(aws ecr get-login --no-include-email --registry-id $aws_account) &>/dev/null || echo 'warning: ecr login failed'
1317
docker pull $aws_account.dkr.ecr.$aws_region.amazonaws.com/sagemaker-tensorflow-serving:$full_version-$arch &>/dev/null || echo 'warning: pull failed'

scripts/shared.sh

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
function error() {
66
>&2 echo $1
7-
>&2 echo "usage: $0 [--version <major-version>] [--arch (cpu*|gpu)] [--region <aws-region>]"
7+
>&2 echo "usage: $0 [--version <major-version>] [--arch (cpu*|gpu|ei)] [--region <aws-region>]"
88
exit 1
99
}
1010

@@ -28,6 +28,17 @@ function get_aws_account() {
2828
aws sts get-caller-identity --query 'Account' --output text
2929
}
3030

31+
function get_tfs_executable() {
32+
zip_file=$(aws s3 ls 's3://amazonei-tensorflow/Tensorflow Serving/v'${version}'/Ubuntu/' | awk '{print $4}')
33+
aws s3 cp 's3://amazonei-tensorflow/Tensorflow Serving/v'${version}'/Ubuntu/'${zip_file} .
34+
35+
mkdir exec_dir
36+
unzip ${zip_file} -d exec_dir
37+
38+
find . -name AmazonEI_TensorFlow_Serving_v${version}_v1* -exec mv {} container/ \;
39+
rm ${zip_file} && rm -rf exec_dir
40+
}
41+
3142
function parse_std_args() {
3243
# defaults
3344
arch='cpu'
@@ -63,7 +74,7 @@ function parse_std_args() {
6374
done
6475

6576
[[ -z "${version// }" ]] && error 'missing version'
66-
[[ "$arch" =~ ^(cpu|gpu)$ ]] || error "invalid arch: $arch"
77+
[[ "$arch" =~ ^(cpu|gpu|ei)$ ]] || error "invalid arch: $arch"
6778
[[ -z "${aws_region// }" ]] && error 'missing aws region'
6879

6980
full_version=$(get_full_version $version)

test/conftest.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,3 @@
1010
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
1111
# ANY KIND, either express or implied. See the License for the specific
1212
# language governing permissions and limitations under the License.
13-

test/sagemaker/conftest.py

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License"). You
4+
# may not use this file except in compliance with the License. A copy of
5+
# the License is located at
6+
#
7+
# http://aws.amazon.com/apache2.0/
8+
#
9+
# or in the "license" file accompanying this file. This file is
10+
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11+
# ANY KIND, either express or implied. See the License for the specific
12+
# language governing permissions and limitations under the License.
13+
14+
import logging
15+
16+
import boto3
17+
import pytest
18+
from sagemaker import Session
19+
from sagemaker.tensorflow import TensorFlow
20+
21+
logger = logging.getLogger(__name__)
22+
logging.getLogger('boto').setLevel(logging.INFO)
23+
logging.getLogger('botocore').setLevel(logging.INFO)
24+
logging.getLogger('factory.py').setLevel(logging.INFO)
25+
logging.getLogger('auth.py').setLevel(logging.INFO)
26+
logging.getLogger('connectionpool.py').setLevel(logging.INFO)
27+
28+
29+
def pytest_addoption(parser):
30+
parser.addoption('--aws-id')
31+
parser.addoption('--docker-base-name', default='sagemaker-tensorflow-serving')
32+
parser.addoption('--instance-type')
33+
parser.addoption('--accelerator-type', default=None)
34+
parser.addoption('--region', default='us-west-2')
35+
parser.addoption('--framework-version', default=TensorFlow.LATEST_VERSION)
36+
parser.addoption('--processor', default='cpu', choices=['gpu', 'cpu'])
37+
parser.addoption('--tag')
38+
39+
40+
@pytest.fixture(scope='session')
41+
def aws_id(request):
42+
return request.config.getoption('--aws-id')
43+
44+
45+
@pytest.fixture(scope='session')
46+
def docker_base_name(request):
47+
return request.config.getoption('--docker-base-name')
48+
49+
50+
@pytest.fixture(scope='session')
51+
def instance_type(request):
52+
return request.config.getoption('--instance-type')
53+
54+
55+
@pytest.fixture(scope='session')
56+
def accelerator_type(request):
57+
return request.config.getoption('--accelerator-type')
58+
59+
60+
@pytest.fixture(scope='session')
61+
def region(request):
62+
return request.config.getoption('--region')
63+
64+
65+
@pytest.fixture(scope='session')
66+
def framework_version(request):
67+
return request.config.getoption('--framework-version')
68+
69+
70+
@pytest.fixture(scope='session')
71+
def processor(request):
72+
return request.config.getoption('--processor')
73+
74+
75+
@pytest.fixture(scope='session')
76+
def tag(request, framework_version, processor):
77+
provided_tag = request.config.getoption('--tag')
78+
default_tag = '{}-{}'.format(framework_version, processor)
79+
return provided_tag if provided_tag is not None else default_tag
80+
81+
82+
@pytest.fixture(scope='session')
83+
def docker_registry(aws_id, region):
84+
return '{}.dkr.ecr.{}.amazonaws.com'.format(aws_id, region)
85+
86+
87+
@pytest.fixture(scope='module')
88+
def docker_image(docker_base_name, tag):
89+
return '{}:{}'.format(docker_base_name, tag)
90+
91+
92+
@pytest.fixture(scope='module')
93+
def docker_image_uri(docker_registry, docker_image):
94+
uri = '{}/{}'.format(docker_registry, docker_image)
95+
return uri
96+
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License"). You
4+
# may not use this file except in compliance with the License. A copy of
5+
# the License is located at
6+
#
7+
# http://aws.amazon.com/apache2.0/
8+
#
9+
# or in the "license" file accompanying this file. This file is
10+
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11+
# ANY KIND, either express or implied. See the License for the specific
12+
# language governing permissions and limitations under the License.
13+
import io
14+
import json
15+
import logging
16+
import time
17+
18+
import boto3
19+
import numpy as np
20+
21+
import pytest
22+
23+
EI_SUPPORTED_REGIONS = ['us-east-1', 'us-east-2', 'us-west-2', 'eu-west-1', 'ap-northeast-1', 'ap-northeast-2']
24+
25+
logger = logging.getLogger(__name__)
26+
logging.getLogger('boto3').setLevel(logging.INFO)
27+
logging.getLogger('botocore').setLevel(logging.INFO)
28+
logging.getLogger('factory.py').setLevel(logging.INFO)
29+
logging.getLogger('auth.py').setLevel(logging.INFO)
30+
logging.getLogger('connectionpool.py').setLevel(logging.INFO)
31+
logging.getLogger('session.py').setLevel(logging.DEBUG)
32+
logging.getLogger('functional').setLevel(logging.DEBUG)
33+
34+
35+
@pytest.fixture(autouse=True)
36+
def skip_if_no_accelerator(accelerator_type):
37+
if accelerator_type is None:
38+
pytest.skip('Skipping because accelerator type was not provided')
39+
40+
41+
@pytest.fixture(autouse=True)
42+
def skip_if_non_supported_ei_region(region):
43+
if region not in EI_SUPPORTED_REGIONS:
44+
pytest.skip('EI is not supported in {}'.format(region))
45+
46+
47+
@pytest.fixture
48+
def pretrained_model_data(region):
49+
return 's3://sagemaker-sample-data-{}/tensorflow/model/resnet/resnet_50_v2_fp32_NCHW.tar.gz'.format(region)
50+
51+
52+
def _timestamp():
53+
return time.strftime("%Y-%m-%d-%H-%M-%S")
54+
55+
56+
def _execution_role(session):
57+
return session.resource('iam').Role('SageMakerRole').arn
58+
59+
60+
def _production_variants(model_name, instance_type, accelerator_type):
61+
production_variants = [{
62+
'VariantName': 'AllTraffic',
63+
'ModelName': model_name,
64+
'InitialInstanceCount': 1,
65+
'InstanceType': instance_type,
66+
'AcceleratorType': accelerator_type
67+
}]
68+
return production_variants
69+
70+
71+
def _create_model(session, client, docker_image_uri, pretrained_model_data):
72+
model_name = 'test-tfs-ei-model-{}'.format(_timestamp())
73+
client.create_model(ModelName=model_name,
74+
ExecutionRoleArn=_execution_role(session),
75+
PrimaryContainer={
76+
'Image': docker_image_uri,
77+
'ModelDataUrl': pretrained_model_data
78+
})
79+
80+
81+
def _create_endpoint(client, endpoint_config_name, endpoint_name, model_name, instance_type, accelerator_type):
82+
client.create_endpoint_config(EndpointConfigName=endpoint_config_name,
83+
ProductionVariants=_production_variants(model_name, instance_type, accelerator_type))
84+
85+
client.create_endpoint(EndpointName=endpoint_name,
86+
EndpointConfigName=endpoint_config_name)
87+
88+
logger.info('deploying model to endpoint: {}'.format(endpoint_name))
89+
90+
try:
91+
client.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
92+
finally:
93+
status = client.describe_endpoint(EndpointName=endpoint_name)['EndpointStatus']
94+
if status != 'InService':
95+
logger.error('failed to create endpoint: {}'.format(endpoint_name))
96+
raise Exception('Failed to create endpoint.')
97+
98+
99+
@pytest.mark.skip_if_non_supported_ei_region
100+
@pytest.mark.skip_if_no_accelerator
101+
def test_deploy_elastic_inference_with_pretrained_model(pretrained_model_data,
102+
docker_image_uri,
103+
instance_type,
104+
accelerator_type):
105+
endpoint_name = 'test-tfs-ei-deploy-model-{}'.format(_timestamp())
106+
endpoint_config_name = 'test-tfs-endpoint-config-{}'.format(_timestamp())
107+
model_name = 'test-tfs-ei-model-{}'.format(_timestamp())
108+
109+
session = boto3.Session()
110+
client = session.client('sagemaker')
111+
runtime_client = session.client('runtime.sagemaker')
112+
113+
_create_model(session, client, docker_image_uri, pretrained_model_data)
114+
_create_endpoint(client, endpoint_config_name, endpoint_name, model_name, instance_type, accelerator_type)
115+
116+
input_data = {'instances': np.random.rand(1, 1, 3, 3).tolist()}
117+
118+
try:
119+
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
120+
ContentType='application/json',
121+
Body=json.dumps(input_data))
122+
result = json.loads(response['Body'].read().decode())
123+
assert result['predictions'] is not None
124+
finally:
125+
logger.info('deleting endpoint, endpoint config and model.')
126+
client.delete_endpoint(EndpointName=endpoint_name)
127+
client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
128+
client.delete_model(ModelName=model_name)

tox.ini

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ require-code = True
4343
# Can be used to specify which tests to run, e.g.: tox -- -s
4444
basepython = python3
4545
commands =
46-
python -m pytest {posargs}
46+
python -m pytest {posargs} --ignore=test/sagemaker
4747
deps =
4848
pytest
4949
requests

0 commit comments

Comments
 (0)