-
Notifications
You must be signed in to change notification settings - Fork 1.2k
doc: Support for generation of Jumpstart model table on build #2888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: Support for generation of Jumpstart model table on build #2888
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
doc/conf.py
Outdated
@@ -15,6 +15,10 @@ | |||
|
|||
import pkg_resources | |||
from datetime import datetime | |||
import sys | |||
import os | |||
sys.path.append(os.path.join(os.path.dirname(__file__), '/doc/')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sphinx
has issues importing modules from the file system. You have to explicitly define where it should look sphinx-doc/sphinx#2390
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: a fast follow, could you add unit tests for doc/jumpstart/utils.py
?
doc/jumpstart/utils.py
Outdated
def get_jumpstart_sdk_manifest(): | ||
s3_client = boto3.client("s3") | ||
studio_metadata = ( | ||
s3_client.get_object(Bucket=JUMPSTART_BUCKET, Key=SDK_MANIFEST_FILE)["Body"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a text-book case of:
- [PM/customer]: "it doesn't work"
- [engineer]: "it works in my local environment"
Taking a step back: boto3
relies on botocore
to make the http calls to the AWS service endpoints. botocore
automatically adds identification headers to these requests. To populate these headers, botocore
needs to access AWS credentials, generally from environment variables. Then the AWS services will use these headers to (1) authenticate and (2) authorize, ie check the IAM permissions the principals authenticated in (1) has against the permissions required by the request being made.
So here, you are implicitly making the assumption that (a) the execution environment of readthedocs
where this code will be executed has AWS credentials, and (b) that those AWS credentials have S3:GetObject
permission. If (a) is false, then the authentication will fail, most likely botocore
will raise an exception even before sending the request. If (b) is false, the authorization will fail, and you will get a 403 error.
Luckily, our distribution buckets are world-readable, and for S3
you can use the public URL of the object, instead of going through the service endpoint. You could use wget, but the Pythonic way of doing this is to rely on a fetching library, most likely urllib
. It would look something like:
from urllib import request
url = 'https://jumpstart-cache-prod-us-west-2.s3.us-west-2.amazonaws.com/models_manifest.json'
with request.openurl(url) as f:
models_manifest = f.read().decode("utf-8")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, this makes plenty of sense. Unfortunately there's not a great way to test in the production environment but I should've caught this.
doc/jumpstart/utils.py
Outdated
|
||
def get_jumpstart_sdk_manifest(): | ||
s3_client = boto3.client("s3") | ||
studio_metadata = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this is a fine misnomer, since this file is not the models metadata used in Studio.
Could you rename models_manifest
?
doc/jumpstart/utils.py
Outdated
@@ -0,0 +1,77 @@ | |||
from __future__ import absolute_import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do we need the legal comment at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably
doc/jumpstart/utils.py
Outdated
|
||
f = open("jumpstart/jumpstart.rst", "w") | ||
|
||
f.write("==================================\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we write to the file each time, or can we instead build a list of strings, and do something like
lines_of_file = []
lines_of_file.append("==================================")
lines_of_file.append("JumpStart Available Model Table")
...
...
...
file_content = '\n'.join(lines_of_file)
f.write(file_content)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be necessary, the http calls to download the metadata are probably orders of magnitude longer to execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is personal preference- I'm not positive what the pythonic best practice here would be
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking comments
import json | ||
from packaging.version import Version | ||
|
||
JUMPSTART_REGION = "eu-west-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: if you have time, please add a comment as to why we use DUB
so that someone doesn't just change it back to us-west-2
. Also, could you rename JUMPSTART_METADATA_REGION
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
doc/doc_utils/jumpstart_doc_utils.py
Outdated
|
||
JUMPSTART_REGION = "eu-west-2" | ||
SDK_MANIFEST_FILE = "models_manifest.json" | ||
JUMPSTART_BUCKET_BASE_URL = 'https://jumpstart-cache-prod-{}.s3.{}.amazonaws.com'.format(JUMPSTART_REGION, JUMPSTART_REGION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: f-string please
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Co-authored-by: Mufaddal Rohawala <[email protected]>
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Co-authored-by: Mufaddal Rohawala <[email protected]> Co-authored-by: Basil Beirouti <[email protected]> Co-authored-by: Payton Staub <[email protected]> Co-authored-by: Ahsan Khan <[email protected]> Co-authored-by: Shreya Pandit <[email protected]> Co-authored-by: Mufaddal Rohawala <[email protected]> Co-authored-by: Basil Beirouti <[email protected]> Co-authored-by: Payton Staub <[email protected]> Co-authored-by: Mohamed Ali Jamaoui <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Jeniya Tabassum <[email protected]> Co-authored-by: sreedes <[email protected]> Co-authored-by: Navin Soni <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Ameen Khan <[email protected]> Co-authored-by: Zhankui Lu <[email protected]> Co-authored-by: Xiaoguang Chen <[email protected]> Co-authored-by: Jonathan Guinegagne <[email protected]> Co-authored-by: Zhankui Lu <[email protected]> Co-authored-by: Yifei Zhu <[email protected]> Co-authored-by: Qingzi-Lan <[email protected]> Co-authored-by: Navin Soni <[email protected]> Co-authored-by: Xinghan Chen <[email protected]> Co-authored-by: Ivy Bazan <[email protected]> Co-authored-by: IvyBazan <[email protected]>
4fd04f1
to
e3398d9
Compare
…o feat/jumpstart-docs-model-table
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
96c01d7
to
c437191
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
if model["model_id"] not in sdk_manifest_top_versions_for_models: | ||
sdk_manifest_top_versions_for_models[model["model_id"]] = model | ||
else: | ||
if Version( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can use elif
sdk_manifest = get_jumpstart_sdk_manifest() | ||
sdk_manifest_top_versions_for_models = {} | ||
|
||
for model in sdk_manifest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future improvement (non-blocking):
Do we know how much time is added to read and update read the docs for all the jumpstart models (I heard no of models are 300+)?
If it takes a significant time, alternatively we could have checked for jumpstart manifest hash (md5 or something) and compared against one stored in sagemaker defalt bucket and only update if hash mismatches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen this take a max of 5 minutes as it stands. I agree we should definitely look to optimize this call, but we wanted to get P0
into production ASAP so customers could access the model table. We can discuss the implementation further at our next sync.
Issue #, if available:
Description of changes:
This PR augments the
sphinx
build process to add a script that generates an up-to-date table of available JumpStartmodel_id
s in documentation. This is done by adding a setup function todocs/conf.py
. This script will run each time the documentation is built and will pull the current models from JumpStarts S3 bucket (which is publicly readable).Testing done:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.