Skip to content
This repository was archived by the owner on May 23, 2024. It is now read-only.

Commit a4b647b

Browse files
author
Jonathan Esterhazy
committed
import project files
1 parent 5f5871a commit a4b647b

35 files changed

+1180
-3
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
log.txt

NOTICE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
Sagemaker Tfs Container
1+
Sagemaker TensorFlow Serving Container
22
Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.

README.md

Lines changed: 110 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,115 @@
1-
## Sagemaker Tfs Container
1+
# SageMaker TensorFlow Serving Container
22

3-
A TensorFlow Serving solution for use in SageMaker.
3+
SageMaker TensorFlow Serving Container is an a open source project that builds
4+
docker images for running TensorFlow Serving on
5+
[Amazon SageMaker](https://aws.amazon.com/documentation/sagemaker/).
6+
7+
This documentation covers building and testing these docker images.
8+
9+
For information about using TensorFlow Serving on SageMaker, see:
10+
[Deploying to TensorFlow Serving Endpoints](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst)
11+
in the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) documentation.
12+
13+
For notebook examples, see: [Amazon SageMaker Examples](https://github.com/awslabs/amazon-sagemaker-examples).
14+
15+
## Table of Contents
16+
17+
1. [Getting Started](#getting-started)
18+
2. [Building your image](#building-your-image)
19+
3. [Running the tests](#running-the-tests)
20+
21+
## Getting Started
22+
23+
### Prerequisites
24+
25+
Make sure you have installed all of the following prerequisites on your
26+
development machine:
27+
28+
- [Docker](https://www.docker.com/)
29+
- [AWS CLI](https://aws.amazon.com/cli/)
30+
31+
For testing, you will also need:
32+
33+
- [Python 3.5+](https://www.python.org/)
34+
- [pytest](https://docs.pytest.org/en/latest/)
35+
- The Python [requests](http://docs.python-requests.org/en/master/) library
36+
37+
To test GPU images locally, you will also need:
38+
39+
- [nvidia-docker](https://github.com/NVIDIA/nvidia-docker)
40+
41+
**Note:** Some of the build and tests scripts interact with resources in your AWS account. Be sure to
42+
set your default AWS credentials and region using `aws configure` before using these scripts.
43+
44+
## Building your image
45+
46+
Amazon SageMaker uses Docker containers to run all training jobs and inference endpoints.
47+
48+
The Docker images are built from the Dockerfiles in
49+
[docker/](https://github.com/aws/sagemaker-tensorflow-serving-container/tree/master/docker>).
50+
51+
The Dockerfiles are grouped based on the version of TensorFlow Serving they support. Each supported
52+
processor type (e.g. "cpu", "gpu") has a different Dockerfile in each group.
53+
54+
To build an image, run the `./scripts/build.sh` script:
55+
56+
```bash
57+
./scripts/build.sh --version 1.11 --arch cpu
58+
./scripts/build.sh --version 1.11 --arch gpu
59+
```
60+
61+
62+
If your are testing locally, building the image is enough. But if you want to your updated image
63+
in SageMaker, you need to publish it to an ECR repository in your account. The
64+
`./scripts/publish.sh` script makes that easy:
65+
66+
```bash
67+
./scripts/publish.sh --version 1.11 --arch cpu
68+
./scripts/publish.sh --version 1.11 --arch gpu
69+
```
70+
71+
Note: this will publish to ECR in your default region. Use the `--region` argument to
72+
specify a different region.
73+
74+
### Running your image in local docker
75+
76+
You can also run your container locally in Docker to test different models and input
77+
inference requests by hand. Standard `docker run` commands (or `nvidia-docker run` for
78+
GPU images) will work for this, or you can use the provided `start.sh`
79+
and `stop.sh` scripts:
80+
81+
```bash
82+
./scripts/start.sh [--version x.xx] [--arch cpu|gpu|...]
83+
./scripts/stop.sh [--version x.xx] [--arch cpu|gpu|...]
84+
```
85+
86+
When the container is running, you can send test requests to it using any HTTP client. Here's
87+
and an example using the `curl` command:
88+
89+
```bash
90+
curl -X POST --data-binary @test/resources/inputs/test.json \
91+
-H 'Content-Type: application/json' \
92+
-H 'X-Amzn-SageMaker-Custom-Attributes: tfs-model-name=half_plus_three' \
93+
http://localhost:8080/invocations
94+
```
95+
96+
Additional `curl` examples can be found in `./scripts/curl.sh`.
97+
98+
## Running the tests
99+
100+
The package includes some automated unit and integration tests. These tests use Docker to run
101+
your image locally, and do not access resources in AWS. You can run them using `pytest`:
102+
103+
```bash
104+
pytest ./test
105+
```
106+
107+
## Contributing
108+
109+
Please read [CONTRIBUTING.md](https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/CONTRIBUTING.md)
110+
for details on our code of conduct, and the process for submitting pull requests to us.
4111

5112
## License
6113

7114
This library is licensed under the Apache 2.0 License.
115+
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
load_module modules/ngx_http_js_module.so;
2+
3+
worker_processes auto;
4+
daemon off;
5+
pid /tmp/nginx.pid;
6+
error_log /dev/stderr %NGINX_LOG_LEVEL%;
7+
8+
worker_rlimit_nofile 4096;
9+
10+
events {
11+
worker_connections 2048;
12+
}
13+
14+
http {
15+
include /etc/nginx/mime.types;
16+
default_type application/json;
17+
access_log /dev/stdout combined;
18+
js_include tensorflow-serving.js;
19+
20+
upstream tfs_upstream {
21+
server localhost:%TFS_REST_PORT%;
22+
}
23+
24+
server {
25+
listen %NGINX_HTTP_PORT% deferred;
26+
client_max_body_size 0;
27+
client_body_buffer_size 100m;
28+
subrequest_output_buffer_size 100m;
29+
30+
set $default_tfs_model %TFS_DEFAULT_MODEL_NAME%;
31+
32+
location /tfs {
33+
rewrite ^/tfs/(.*) /$1 break;
34+
proxy_redirect off;
35+
proxy_pass_request_headers off;
36+
proxy_set_header Content-Type 'application/json';
37+
proxy_set_header Accept 'application/json';
38+
proxy_pass http://tfs_upstream;
39+
}
40+
41+
location /ping {
42+
js_content ping;
43+
}
44+
45+
location /invocations {
46+
js_content invocations;
47+
}
48+
49+
location / {
50+
return 404 '{"error": "Not Found"}';
51+
}
52+
53+
keepalive_timeout 3;
54+
}
55+
}

container/sagemaker/serve

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
#!/bin/bash
2+
3+
python3 /sagemaker/serve.py

container/sagemaker/serve.py

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License"). You
4+
# may not use this file except in compliance with the License. A copy of
5+
# the License is located at
6+
#
7+
# http://aws.amazon.com/apache2.0/
8+
#
9+
# or in the "license" file accompanying this file. This file is
10+
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11+
# ANY KIND, either express or implied. See the License for the specific
12+
# language governing permissions and limitations under the License.
13+
14+
import logging
15+
import os
16+
import re
17+
import signal
18+
import subprocess
19+
20+
logging.basicConfig(level=logging.INFO)
21+
log = logging.getLogger(__name__)
22+
23+
24+
class ServiceManager(object):
25+
def __init__(self):
26+
self._state = 'initializing'
27+
self._nginx = None
28+
self._tfs = None
29+
self._nginx_http_port = os.environ.get('SAGEMAKER_BIND_TO_PORT', '8080')
30+
self._nginx_loglevel = os.environ.get('SAGEMAKER_TFS_NGINX_LOGLEVEL', 'error')
31+
32+
self._tfs_default_model_name = os.environ.get('SAGEMAKER_TFS_DEFAULT_MODEL_NAME', None)
33+
34+
if 'SAGEMAKER_SAFE_PORT_RANGE' in os.environ:
35+
port_range = os.environ['SAGEMAKER_SAFE_PORT_RANGE']
36+
parts = port_range.split('-')
37+
low = int(parts[0])
38+
hi = int(parts[1])
39+
if low + 1 > hi:
40+
raise ValueError('not enough ports available in SAGEMAKER_SAFE_PORT_RANGE ({})',
41+
port_range)
42+
self._tfs_grpc_port = str(low)
43+
self._tfs_rest_port = str(low + 1)
44+
else:
45+
# just use the standard default ports
46+
self._tfs_grpc_port = '9000'
47+
self._tfs_rest_port = '8501'
48+
49+
def _create_tfs_config(self):
50+
models = self._find_models()
51+
52+
if not models:
53+
raise ValueError('no SavedModel bundles found!')
54+
55+
if self._tfs_default_model_name is None:
56+
self._tfs_default_model_name = os.path.basename(models[0])
57+
log.info('using default model name: {}'.format(self._tfs_default_model_name))
58+
59+
# config (may) include duplicate 'config' keys, so we can't just dump a dict
60+
config = 'model_config_list: {\n'
61+
for m in models:
62+
config += ' config: {\n'
63+
config += ' name: "{}",\n'.format(os.path.basename(m))
64+
config += ' base_path: "{}",\n'.format(m)
65+
config += ' model_platform: "tensorflow"\n'
66+
config += ' },\n'
67+
config += '}\n'
68+
69+
log.info('tensorflow serving model config: \n%s\n', config)
70+
71+
with open('/sagemaker/model-config.cfg', 'w') as f:
72+
f.write(config)
73+
74+
def _find_models(self):
75+
base_path = '/opt/ml/model'
76+
models = []
77+
for f in self._find_saved_model_files(base_path):
78+
parts = f.split('/')
79+
if len(parts) >= 6 and re.match('^\d+$', parts[-2]):
80+
model_path = '/'.join(parts[0:-2])
81+
if model_path not in models:
82+
models.append(model_path)
83+
return models
84+
85+
def _find_saved_model_files(self, path):
86+
for e in os.scandir(path):
87+
if e.is_dir():
88+
yield from self._find_saved_model_files(os.path.join(path, e.name))
89+
else:
90+
if e.name == 'saved_model.pb':
91+
yield os.path.join(path, e.name)
92+
93+
def _create_nginx_config(self):
94+
template = self._read_nginx_template()
95+
pattern = re.compile(r'%(\w+)%')
96+
template_values = {
97+
'TFS_REST_PORT': self._tfs_rest_port,
98+
'TFS_DEFAULT_MODEL_NAME': self._tfs_default_model_name,
99+
'NGINX_HTTP_PORT': self._nginx_http_port,
100+
'NGINX_LOG_LEVEL': self._nginx_loglevel
101+
}
102+
103+
config = pattern.sub(lambda x: template_values[x.group(1)], template)
104+
log.info('nginx config: \n%s\n', config)
105+
106+
with open('/sagemaker/nginx.conf', 'w') as f:
107+
f.write(config)
108+
109+
def _read_nginx_template(self):
110+
with open('/sagemaker/nginx.conf.template', 'r') as f:
111+
template = f.read()
112+
if not template:
113+
raise ValueError('failed to read nginx.conf.template')
114+
115+
return template
116+
117+
def _start_tfs(self):
118+
tfs_config_path = '/sagemaker/model-config.cfg'
119+
cmd = "tensorflow_model_server --port={} --rest_api_port={} --model_config_file={}".format(
120+
self._tfs_grpc_port, self._tfs_rest_port, tfs_config_path)
121+
log.info('tensorflow serving command: {}'.format(cmd))
122+
p = subprocess.Popen(cmd.split())
123+
log.info('started tensorflow serving (pid: %d)', p.pid)
124+
self._tfs = p
125+
126+
def _start_nginx(self):
127+
p = subprocess.Popen('/usr/sbin/nginx -c /sagemaker/nginx.conf'.split())
128+
log.info('started nginx (pid: %d)', p.pid)
129+
self._nginx = p
130+
131+
def _stop(self, *args):
132+
self._state = 'stopping'
133+
log.info('stopping services')
134+
try:
135+
os.kill(self._nginx.pid, signal.SIGQUIT)
136+
except OSError:
137+
pass
138+
try:
139+
os.kill(self._tfs.pid, signal.SIGTERM)
140+
except OSError:
141+
pass
142+
143+
self._state = 'stopped'
144+
log.info('stopped')
145+
146+
def start(self):
147+
log.info('starting services')
148+
self._state = 'starting'
149+
signal.signal(signal.SIGTERM, self._stop)
150+
151+
# TODO set env vars for ports etc
152+
self._create_tfs_config()
153+
self._create_nginx_config()
154+
155+
self._start_tfs()
156+
self._start_nginx()
157+
self._state = 'started'
158+
159+
while True:
160+
pid, status = os.wait()
161+
162+
if self._state != 'started':
163+
break
164+
165+
if pid == self._nginx.pid:
166+
log.warning('unexpected nginx exit (status: {}). restarting.'.format(status))
167+
self._start_nginx()
168+
169+
elif pid == self._tfs.pid:
170+
log.warning(
171+
'unexpected tensorflow serving exit (status: {}). restarting.'.format(status))
172+
self._start_tfs()
173+
174+
self._stop()
175+
176+
177+
if __name__ == '__main__':
178+
ServiceManager().start()

0 commit comments

Comments
 (0)