Skip to content

feature: client cache for jumpstart models #2756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

evakravi
Copy link
Member

@evakravi evakravi commented Nov 10, 2021

Issue #, if available:

Description of changes:
This PR introduces a client-side cache for JumpStart models. The cache stores manifests and test specs associated with JumpStart models. In addition, this PR implements a low-level LRU cache class (with expiring elements). Currently, JumpStart SDK content is not launched in any region, so this code can only be used by plugging the cache to a pre-defined bucket (no default bucket).

In order to query models, the API for the cache requires a model id and semantic version. The semantic version relies on the semantic version library. This library allows us to perform version comparison and handle wildcard versions (e.g. 1.2.*).

The low-level cache class has been created in sagemaker.utilities.cache module. JumpStart-specific code is in sagemaker.jumpstart module.

The cache can be used as follows:

>> cache = JumpStartModelsCache(bucket="some_bucket")
>> header = cache.get_header(model_id="tensorflow-ic-imagenet-inception-v3-classification-4")
>> print(header.to_json())
{
           "model_id": "tensorflow-ic-imagenet-inception-v3-classification-4",
           "version": "2.0.0",
           "min_version": "2.49.0",
           "spec_key": "community_models_specs/tensorflow-ic-imagenet-inception-v3-classification-4/specs_v2.0.0.json",
}

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I certify that the changes I am introducing will be backword compatible, and I have discussed concerns about this, if any, with the Python SDK team
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 3712df6
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 3712df6
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 3712df6
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@codecov-commenter
Copy link

codecov-commenter commented Nov 11, 2021

Codecov Report

Merging #2756 (8f8b1aa) into master-jumpstart (0a8395a) will increase coverage by 0.14%.
The diff coverage is 95.66%.

Impacted file tree graph

@@                 Coverage Diff                  @@
##           master-jumpstart    #2756      +/-   ##
====================================================
+ Coverage             88.61%   88.76%   +0.14%     
====================================================
  Files                   173      179       +6     
  Lines                 15410    15733     +323     
====================================================
+ Hits                  13656    13965     +309     
- Misses                 1754     1768      +14     
Impacted Files Coverage Δ
src/sagemaker/jumpstart/types.py 89.79% <89.79%> (ø)
src/sagemaker/jumpstart/cache.py 95.87% <95.87%> (ø)
src/sagemaker/jumpstart/constants.py 100.00% <100.00%> (ø)
src/sagemaker/jumpstart/parameters.py 100.00% <100.00%> (ø)
src/sagemaker/jumpstart/utils.py 100.00% <100.00%> (ø)
src/sagemaker/utilities/cache.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a8395a...8f8b1aa. Read the comment docs.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: d08fa11
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 37a502e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@JGuinegagne JGuinegagne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but could you:

  • add unit tests for JumpStartModelsCache
  • test various failure cases (semantic raises an exception, s3:getObject raises a client error/ regular exception, VPC error etc)

from sagemaker.utilities.cache import LRUCache
import boto3
import json
import semantic_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please add the dependency imports at the top of the module, before the local imports?

DEFAULT_MAX_SEMANTIC_VERSION_CACHE_ITEMS = 20
DEFAULT_SEMANTIC_VERSION_CACHE_EXPIRATION_TIME = datetime.timedelta(hours=6)

DEFAULT_MANIFEST_FILE_S3_KEY = "models_manifest.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please move all numeric constants to a parameters.py and the key to a config.py or constants.py module?


Args:
region (Optional[str]): AWS region to associate with cache. Default: region associated
with botocore session.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: with boto3 session.

Args:
region (Optional[str]): AWS region to associate with cache. Default: region associated
with botocore session.
max_s3_cache_items (Optional[int]): Maximum number of files to store in s3 cache. Default: 20.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "file" is confusing since this is an in-memory cache. Refer to items?

semantic version cache. Default: 20.
semantic_version_cache_expiration_time (Optional[datetime.timedelta]): Maximum time to hold
items in semantic version cache before invalidation. Default: 6 hours.
bucket (Optional[str]): S3 bucket to associate with cache. Default: JumpStart-hosted content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean S3 bucket name, correct? please specify, it could be an arn for example.

element_age = curr_time - element.creation_time
if element_age > self._expiration_time:
if fail_on_old_value:
raise KeyError(f"{key} is old! Created at {element.creation_time}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: f"{key} is beyond allowed time {self._expiration_time}. Created at {element.creation_time}."

"""
self._max_cache_items = max_cache_items
self._lru_cache: collections.OrderedDict = collections.OrderedDict()
self._expiration_time = expiration_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_expiration_time is a bit misleading because it could be interpreted as an absolute time, whereas it is a time delta. I am not sure what to replace it with though, maybe horizon ?

retrieval_function=retrieval_function,
)

curr_time = datetime.datetime.now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for date test, try to avoid .now() or any relative date. Instead pin the date and timezone. Otherwise, you can get your test randomly failing one day, for example in a pipeline. This usually happens over gap years/daylight saving day or other edge cases.


my_cache._retrieval_function.reset_mock()
my_cache.get(0)
my_cache._retrieval_function.assert_called()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice tests!

optional: would you like to check what it was called with?

my_cache.clear()
assert len(my_cache) == 0
with pytest.raises(KeyError):
my_cache.get(1, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a few more test cases:

  • retrieve_function is None
  • retrieve_function does not have the right signature
  • retrieve_function raises an Exception
  • retrieve_function raises a botocore.exceptions.ClientError
  • retrieve_function raises a VPC-related error (get ETIMEDOUT)

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: d7ff8f8
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 7511bb0
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@evakravi evakravi marked this pull request as ready for review November 15, 2021 18:12
Copy link
Contributor

@JGuinegagne JGuinegagne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two remaining comments:

  • revisit etag logic
  • unit tests for JumpStart cache

The other comments are nit.

if value is not None and etag == value.md5_hash:
return value
response = self._s3_client.get_object(Bucket=self.s3_bucket_name, Key=s3_key)
formatted_body = json.loads(response["Body"].read().decode("utf-8"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several issues here:

  • if value is None, you are making an unnecessary http call
  • if value is not None but the etag are different, you risk caching the wrong etag

Does this work?

if value is not None:
    etag = self._s3_client.head_object(Bucket=self.s3_bucket_name, Key=s3_key)["ETag"]
    if  etag == value.md5_hash:
         return value

response = self._s3_client.get_object(Bucket=self.s3_bucket_name, Key=s3_key)
formatted_body = json.loads(response["Body"].read().decode("utf-8"))
etag = response["ETag"]
return JumpStartCachedS3ContentValue(
    formatted_file_content=utils.get_formatted_manifest(formatted_body),
    md5_hash=etag,
)

While we are at it, can you unit test this behavior please?

return self._get_header_impl(model_id, 0, semantic_version_str)

def _get_header_impl(
self, model_id: str, attempt: int, semantic_version_str: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: how about setting a default for attempt at 0?

rejected.

Raises:
RuntimeError: If the SageMaker version is not readable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: you can also mention the exception raised by semantic_version.Version if you pass it an invalid version, for example semantic_version.Version("..")

def _get_item(self, key: KeyType, fail_on_old_value: bool) -> ValType:
"""Returns value from cache corresponding to key.

If ``fail_on_old_value``, a KeyError is thrown instead of a new value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit^2: python linguo calls for a KeyError is raised

retrieval_function=retrieval_function,
)

curr_time = datetime.datetime.fromtimestamp(1636730651.079551)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: calling this curr_time made me to a double-take. How about mock_curr_time ?

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 04143a6
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

"head_object", service_error_code="EndpointConnectionError"
)
with pytest.raises(botocore.exceptions.ClientError):
cache3.get_header(model_id="pytorch-ic-imagenet-inception-v3-classification-4")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: we may want to create a class that extends unittest.caseCase and break down these tests into several class methods.

@shreyapandit shreyapandit changed the base branch from master to master-jumpstart November 17, 2021 23:04
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 02b1a97
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 6604acc
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 6604acc
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: e94a87b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: b8e130c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: b8e130c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: e94a87b
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@@ -27,3 +27,5 @@
JUMPSTART_DEFAULT_REGION_NAME = boto3.session.Session().region_name

JUMPSTART_DEFAULT_MANIFEST_FILE_S3_KEY = "models_manifest.json"

PARSED_SAGEMAKER_VERSION = "" # this gets set by sagemaker.jumpstart.utils.get_sagemaker_version()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is a bit misleading: this isn't a constant but a state variable.

how about moving this to the jumpstart/utils.py and prefix it with an underline so that mypy picks it up as an invalid import if it's used elsewhere?

_sagemaker_version = ""

def get_sagemaker_version() -> str:
    if _sagemaker_version == "":
        _sagemaker_version = parse_sagemaker_version()
    return _sagemaker_version

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 3bde038
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 85c65d5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 917ea82
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 917ea82
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 8f8b1aa
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@@ -44,6 +44,7 @@ def read_version():
"packaging>=20.0",
"pandas",
"pathos",
"semantic-version",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add extra package? Whats the use of this?

Copy link
Member

@mufaddal-rohawala mufaddal-rohawala Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evakravi There are reservations to add new dependencies in SDK due to some inherrant constraints. Is there a possibility to use some other alternatives currently existing in SDK for versioning? Can we use packaging.version here.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 659c90b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 659c90b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 659c90b
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 659c90b
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 659c90b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@shreyapandit shreyapandit changed the title feat: client cache for jumpstart models feature: client cache for jumpstart models Dec 7, 2021
@shreyapandit
Copy link
Contributor

Merging this with a note:

We would like to use the packaging module for doing versioning checks since that is a required dependency that is already installed, and it makes sense to reuse that instead of installing a separate package. It offers similar APIs to the ones you use for the versioning comparisons. Please address that in the next PR you make to the master-jumpstart branch, thanks!

@shreyapandit shreyapandit merged commit b09793a into aws:master-jumpstart Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants