Skip to content
This repository was archived by the owner on May 23, 2024. It is now read-only.

Add EI support to TFS container. #10

Merged
merged 6 commits into from
Mar 4, 2019
Merged

Add EI support to TFS container. #10

merged 6 commits into from
Mar 4, 2019

Conversation

chuyang-deng
Copy link
Contributor

@chuyang-deng chuyang-deng commented Feb 25, 2019

Issue #, if available:
aws/amazon-sagemaker-examples#613
aws/sagemaker-python-sdk@4732545#r32336726

Description of changes:
Introduce a Dockerfile.ei to build TFS container with EI support.
Modify scripts to build and run Dockerfile.ei:

  • build.sh: login to account to pull cpu image since Dockerfile.ei depends on Dockerfile.cpu; also add a build argument to pass in the AmazonEI model server binary
  • shared.sh: add commands to parse the model variable

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jesterhazy
Copy link
Contributor

AWS CodeBuild CI Report

  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link

@ChoiByungWook ChoiByungWook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a EI specific integ test for this?

In addition, we should update the docs: https://github.com/aws/sagemaker-tensorflow-serving-container#building-your-image.

# Validate the argument is specified
RUN test $TENSORFLOW_MODEL_SERVER || exit 1

COPY ./$TENSORFLOW_MODEL_SERVER /usr/bin/tensorflow_model_server

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ./ needed?

@@ -32,6 +32,7 @@ function parse_std_args() {
# defaults
arch='cpu'
version='1.12.0'
model='AmazonEI_TensorFlow_Serving_v1.12_v1'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file hosted in S3 is a zip, where does that get handled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also 'model' isn't a good name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the s3 fetch is missing, and there should be updates to the build.sh as well (or nothing is getting built).

also, if the S3 fetch normalized the file name when it downloaded, then there would likely be no need to pass the filename into the dockerfile as an ARG.

scripts/build.sh Outdated
# logout 520713654638 account after building
if [ "$arch" = "ei" ]; then
docker logout https://520713654638.dkr.ecr.$aws_region.amazonaws.com &>/dev/null
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add nl @ eof

scripts/build.sh Outdated
# Dockerfile.ei needs to login 520713654638 to pull the cpu image
if [ "$arch" = "ei" ]; then
echo "pulling cpu image..."
$(aws ecr get-login --no-include-email --registry-id 520713654638) &>/dev/null || echo 'warning: ecr login failed'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no do not log in to prod repo. this breaks pr and release build flow.

@@ -0,0 +1,10 @@
ARG TFS_VERSION

FROM 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tensorflow-serving:${TFS_VERSION}-cpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no do not use the prod repo! this breaks pr and release build flow. the image will be built from the previously released production image instead of the current image and we will ship a stale container!

@@ -55,6 +56,11 @@ function parse_std_args() {
shift
shift
;;
-m|--model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model isn't a good name for something that points at a tfs executable

ARG TENSORFLOW_MODEL_SERVER

# Validate the argument is specified
RUN test $TENSORFLOW_MODEL_SERVER || exit 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this check is needed. docker build should fail by itself it you don't specify, because the arg declaration has no default

scripts/build.sh Outdated

echo "building image... "
docker build \
--cache-from $aws_account.dkr.ecr.$aws_region.amazonaws.com/sagemaker-tensorflow-serving:$full_version-$arch \
--build-arg TFS_VERSION=$full_version \
--build-arg TFS_SHORT_VERSION=$short_version \
--build-arg TENSORFLOW_MODEL_SERVER=$model \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider EI_EXECUTABLE or TFS_EI_EXECUTABLE instead

@jesterhazy
Copy link
Contributor

AWS CodeBuild CI Report

  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@chuyang-deng
Copy link
Contributor Author

I've updated the scripts to handle zip files from s3 bucket, changed the EI dockerfile to avoid using prod account, dropped unnecessary arguments. I also updated README and add a functional test.

@jesterhazy
Copy link
Contributor

AWS CodeBuild CI Report

  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository


@pytest.mark.skip_if_non_supported_ei_region
@pytest.mark.skip_if_no_accelerator
def test_deploy_elastic_inference_with_pretrained_model(pretrained_model_data, docker_image_uri, sagemaker_session, instance_type, accelerator_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • endpoint needs to be removed at end of test
  • not excited about taking a sagemaker sdk dependency for this. can we change this to use boto3 only? it shouldn't be too complicated since the model is already in s3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh, gotcha!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@ChoiByungWook ChoiByungWook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@@ -28,6 +28,21 @@ function get_aws_account() {
aws sts get-caller-identity --query 'Account' --output text
}

function get_tfs_executable() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function utilize the parsed args below for the TF version?

It also looks like the naming scheme isn't consistent for the zip file for 1.11 and 1.12 already, which might require updating this file often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only utilizes $version here, coz the naming for v1.11 and v1.12 are:
v1.11 -> Ubuntu -> Ubuntu.zip: Unbuntu/executable
v1.12 -> Ubuntu -> tfs_ei_v1_12_ubuntu.zip: v1_12_Ubuntu_2/executable

Even the v1.12's zip name and unzipped directory name are different. So I believe we will need to up date this file in the future.

Copy link

@ChoiByungWook ChoiByungWook Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like in S3 it is always going to follow the pattern of: s3://amazonei-tensorflow/Tensorflow\ Serving/{version}/Ubuntu.

We can do a discovery on the item for example with:
aws s3 ls s3://amazonei-tensorflow/Tensorflow\ Serving/v1.11/Ubuntu/ | awk '{print $4}'
which should return Ubuntu.zip

Afterwards, I think we can probably also discover the unzipped name as well. For example:
find . -name AmazonEI_TensorFlow_Serving_v${version}_v1* -exec mv {} container/ \;

There are probably better commands than the ones I used above in my examples.

@@ -0,0 +1,100 @@
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jesterhazy, do you care about this being under functional or integration/sagemaker?

I think most of our framework containers are under integration/sagemaker, except for the sagemaker-tensorflow-container repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it seems a bit funky to have both. the stuff in test/integration are local docker container tests, so renaming test/functional to test/sagemaker or renaming everything to test/integration/{local, sagemaker} would fit the convention better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename it to test/sagemaker


@pytest.mark.skip_if_non_supported_ei_region
@pytest.mark.skip_if_no_accelerator
def test_deploy_elastic_inference_with_pretrained_model(pretrained_model_data, docker_image_uri, sagemaker_session, instance_type, accelerator_type):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def pytest_addoption(parser):
parser.addoption('--aws-id')
parser.addoption('--docker-base-name', default='functional-tensorflow-serving')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is functional meant to correspond to test/functional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's meant to be sagemaker-tensorflow-serving ... I just accidentally changed it to 'functional' for some reason I dont remember.

@pytest.fixture(scope='session')
def tag(request, framework_version, processor):
provided_tag = request.config.getoption('--tag')
default_tag = '{}-{}-py2'.format(framework_version, processor)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever specify python within the tag for this specific container?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope

EndpointConfigName=endpoint_config_name)

try:
client.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to do this, however I can foresee us reusing this logic, maybe it would be better to refactor to another file?

README.md Outdated
@@ -106,6 +108,15 @@ checkers using `tox`:
tox
```

To test Elastic Inference with Accelerator, you will need an AWS account, publish your built image to ECR repository and run the following command:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide an example of the command needed to run the test?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To test against Elastic Inference, you will..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update these as well as the above comments.

@jesterhazy
Copy link
Contributor

AWS CodeBuild CI Report

  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

result = json.loads(response['Body'].read().decode())
assert result['predictions'] is not None

client.delete_endpoint(EndpointName=endpoint_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete_endpoint should clear endpoint_config too
also needs to be in a finally block so delete happens even if test fails

@jesterhazy
Copy link
Contributor

AWS CodeBuild CI Report

  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link

@ChoiByungWook ChoiByungWook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Let's wait on approval from @jesterhazy



def _execution_role(session):
return session.resource('iam').Role('SageMakerRole').arn

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider adding an argparser for the role?

@@ -0,0 +1,96 @@
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be under test/integration/sagemaker to match the other repos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming test/functional to test/sagemaker or renaming everything to test/integration/{local, sagemaker} would fit the convention better

I went with the option that involves fewer changes. but I'm also okay to put everything under test/integration/... 😺

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do move it, do it in a different PR haha

Copy link
Contributor

@jesterhazy jesterhazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants