Skip to content

fix: disable profiler by default for regions not support it #2134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 27, 2021

Conversation

guoren66
Copy link
Contributor

@guoren66 guoren66 commented Feb 11, 2021

In China regions (cn-north-1/cn-northwest-1), if sdk is updated from 1 to latest one and start a training job, an error occurrs:

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: ProfilerConfig is currently not supported in cn-northwest-1.

Many customers have reported this error (No direct solution to it in the error log) and we have to help resolve it one by one, and this also brings potential customer's churn if we haven't covered them yet.

Description of changes:
add logic of judging whether region supports profiler, if not, set 'disable_profiler' to True.

Testing done:
Unit Testing
Integ Testing

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: eb52ef5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: b3846a9
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: eb52ef5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: b3846a9
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: eb52ef5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: b3846a9
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: b3846a9
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@@ -349,6 +349,9 @@ def __init__(
self.profiler_config = profiler_config
self.disable_profiler = disable_profiler

if not _region_supports_profiler(self.sagemaker_session.boto_region_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a test exercising this branch is warranted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have been passed in Unit and integ test (mxnet using default profiler confit in test profiler) from my side. Do you think it's necessary to test the branch in environment of China regions?

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 7a91c3f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 7a91c3f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 7a91c3f
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 7a91c3f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@guoren66
Copy link
Contributor Author

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 7a91c3f
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@metrizable can you help check the failed test of './amazon-sagemaker-examples/sagemaker-python-sdk/mxnet_onnx_eia/mxnet_onnx_eia.ipynb' in sagemaker-python-sdk-notebook-tests? (container not passing health check), as this PR is not for deployment.
same issue is also found in this PR #2133. Maybe it's a general one.

I have tested 'amazon-sagemaker-examples/sagemaker-python-sdk/mxnet_onnx_eia/mxnet_onnx_eia.ipynb' on 2.24.5 in us-west-2, it's ok.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 3989034
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@ChoiByungWook ChoiByungWook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to throw an exception on the client side if the user attempts to use the profiler explicitly in an unsupported region?

I would rather the user's job request fail fast if they attempt to use a feature in an unsupported region, instead of defaulting everyone off and having the job succeed. The user may get confused about why the profiler wasn't enabled in their job, even though they set the flag.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 3989034
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 3989034
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 3989034
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 3989034
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@guoren66
Copy link
Contributor Author

guoren66 commented Feb 25, 2021

Would it be better to throw an exception on the client side if the user attempts to use the profiler explicitly in an unsupported region?

I would rather the user's job request fail fast if they attempt to use a feature in an unsupported region, instead of defaulting everyone off and having the job succeed. The user may get confused about why the profiler wasn't enabled in their job, even though they set the flag.

Thanks for the concern.

I have received more than 10 cases from customers in China and the error log cannot directly show them how to handle this fail. I also worry about those who are new to SageMaker and we haven't touched directly in China and encounter this fail when start a very simple training job.(this may lead to user churn) It's better to change this parameter to 'True' for the region not supporting Profiler.

I also agree with your concern about those who set this to 'False' but not enabled. So I recommend that we can add a print log inside this 'if' logic, to tell the user that this region is not supporting Profiler and we automatically set this to ‘True’, to eliminate their confusion.

@ChoiByungWook
Copy link
Contributor

Would it be better to throw an exception on the client side if the user attempts to use the profiler explicitly in an unsupported region?
I would rather the user's job request fail fast if they attempt to use a feature in an unsupported region, instead of defaulting everyone off and having the job succeed. The user may get confused about why the profiler wasn't enabled in their job, even though they set the flag.

Thanks for the concern.

I have received more than 10 cases from customers in China and the error log cannot directly show them how to handle this fail. I also worry about those who are new to SageMaker and we haven't touched directly in China and encounter this fail when start a very simple training job.(this may lead to user churn) It's better to change this parameter to 'False' for the region not supporting Profiler.

I also agree with your concern about those who set this to 'True' but not enabled. So I recommend that we can add a print log inside this 'if' logic, to tell the user that this region is not supporting Profiler and we automatically set this to ‘False’, to eliminate their confusion.

Is profiler on by default for all regions or is it because users are attempting to use it in an unsupported region?

@guoren66
Copy link
Contributor Author

Would it be better to throw an exception on the client side if the user attempts to use the profiler explicitly in an unsupported region?
I would rather the user's job request fail fast if they attempt to use a feature in an unsupported region, instead of defaulting everyone off and having the job succeed. The user may get confused about why the profiler wasn't enabled in their job, even though they set the flag.

Thanks for the concern.
I have received more than 10 cases from customers in China and the error log cannot directly show them how to handle this fail. I also worry about those who are new to SageMaker and we haven't touched directly in China and encounter this fail when start a very simple training job.(this may lead to user churn) It's better to change this parameter to 'True' for the region not supporting Profiler.
I also agree with your concern about those who set this to 'False' but not enabled. So I recommend that we can add a print log inside this 'if' logic, to tell the user that this region is not supporting Profiler and we automatically set this to ‘True’, to eliminate their confusion.

Is profiler on by default for all regions or is it because users are attempting to use it in an unsupported region?

(I correct my above comment for the value of this parameter)
Yes. The root cause is profiler is on by default for all the regions. (the 'disable_profiler' is False by default). So this PR is for changing this parameter to 'True' when in unsupported region. I recommend I can add another print in this 'if' logic to tell user the profiler is off automatically even though you configure it to 'False' manually and explicitly.

@ChoiByungWook
Copy link
Contributor

ChoiByungWook commented Feb 26, 2021

Would it be better to throw an exception on the client side if the user attempts to use the profiler explicitly in an unsupported region?
I would rather the user's job request fail fast if they attempt to use a feature in an unsupported region, instead of defaulting everyone off and having the job succeed. The user may get confused about why the profiler wasn't enabled in their job, even though they set the flag.

Thanks for the concern.
I have received more than 10 cases from customers in China and the error log cannot directly show them how to handle this fail. I also worry about those who are new to SageMaker and we haven't touched directly in China and encounter this fail when start a very simple training job.(this may lead to user churn) It's better to change this parameter to 'True' for the region not supporting Profiler.
I also agree with your concern about those who set this to 'False' but not enabled. So I recommend that we can add a print log inside this 'if' logic, to tell the user that this region is not supporting Profiler and we automatically set this to ‘True’, to eliminate their confusion.

Is profiler on by default for all regions or is it because users are attempting to use it in an unsupported region?

(I correct my above comment for the value of this parameter)
Yes. The root cause is profiler is on by default for all the regions. (the 'disable_profiler' is False by default). So this PR is for changing this parameter to 'True' when in unsupported region. I recommend I can add another print in this 'if' logic to tell user the profiler is off automatically even though you configure it to 'False' manually and explicitly.

Well... since it is on by default, I don't think you can discern easily whether or not a user wants to enable this feature in an unsupported region in the current if statement. You would have to do it where the defaults happen.

At least how I understand it is, if we add print statement in here, everyone in an unsupported region will get this print region, even if they aren't aware of this feature.

If the feature doesn't exist in that region, then disabling it by default seems fine. I am wondering why we enabled this feature by default and didn't make it an opt-in feature at release.

@guoren66
Copy link
Contributor Author

Would it be better to throw an exception on the client side if the user attempts to use the profiler explicitly in an unsupported region?
I would rather the user's job request fail fast if they attempt to use a feature in an unsupported region, instead of defaulting everyone off and having the job succeed. The user may get confused about why the profiler wasn't enabled in their job, even though they set the flag.

Thanks for the concern.
I have received more than 10 cases from customers in China and the error log cannot directly show them how to handle this fail. I also worry about those who are new to SageMaker and we haven't touched directly in China and encounter this fail when start a very simple training job.(this may lead to user churn) It's better to change this parameter to 'True' for the region not supporting Profiler.
I also agree with your concern about those who set this to 'False' but not enabled. So I recommend that we can add a print log inside this 'if' logic, to tell the user that this region is not supporting Profiler and we automatically set this to ‘True’, to eliminate their confusion.

Is profiler on by default for all regions or is it because users are attempting to use it in an unsupported region?

(I correct my above comment for the value of this parameter)
Yes. The root cause is profiler is on by default for all the regions. (the 'disable_profiler' is False by default). So this PR is for changing this parameter to 'True' when in unsupported region. I recommend I can add another print in this 'if' logic to tell user the profiler is off automatically even though you configure it to 'False' manually and explicitly.

Well... since it is on by default, I don't think you can discern easily whether or not a user wants to enable this feature in an unsupported region in the current if statement. You would have to do it where the defaults happen.

At least how I understand it is, if we add print statement in here, everyone in an unsupported region will get this print region, even if they aren't aware of this feature.

If the feature doesn't exist in that region, then disabling it by default seems fine. I am wondering why we enabled this feature by default and didn't make it an opt-in feature at release.

I totally agree with you about why we enable this feature by default. I don't know the specific reason maybe have some other considerations. So in this PR I propose the commits after the default setting for unsupported regions, instead of changing the default parameter from 'False' to 'None' in the init function.

Changing the default opt is an important issue I don't think we can do it in this PR.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 6e884b4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 6e884b4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 6e884b4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 6e884b4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@ChoiByungWook ChoiByungWook merged commit 2904cd3 into aws:master Feb 27, 2021
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 6e884b4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants