Skip to content

avoid issuing more than one stop job requests #1791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 25, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,21 @@
"\n",
"* In your AWS console, go to Lambda Management Console,\n",
"* Create a new function by hitting Create Function,\n",
"* Choose the language as Python 3.7 and put in the following sample code for stopping the training job if one of the Rule statuses is `\"IssuesFound\"`:\n",
"* Choose the language as Python 3.7 (or higher) and put in the following sample code for stopping the training job if one of the Rule statuses is `\"IssuesFound\"`:\n",
"\n",
"```python\n",
"import json\n",
"import boto3\n",
"import logging\n",
"\n",
"logger = logging.getLogger()\n",
"logger.setLevel(logging.INFO)\n",
"\n",
"\n",
"def lambda_handler(event, context):\n",
" training_job_name = event.get(\"detail\").get(\"TrainingJobName\")\n",
" logging.info(f'Evaluating Debugger rules for training job: {training_job_name}')\n",
"\n",
" eval_statuses = event.get(\"detail\").get(\"DebugRuleEvaluationStatuses\", None)\n",
"\n",
" if eval_statuses is None or len(eval_statuses) == 0:\n",
Expand All @@ -64,15 +70,25 @@
" 'body': json.dumps('Nothing to do')\n",
" }\n",
"\n",
" # should only attempt stopping jobs with InProgress status\n",
" training_job_status = event.get(\"detail\").get(\"TrainingJobStatus\", None)\n",
" if training_job_status != 'InProgress':\n",
" logging.debug(f\"Current Training job status({training_job_status}) is not 'InProgress'. Exiting\")\n",
" return {\n",
" 'statusCode': 200,\n",
" 'body': json.dumps('Nothing to do')\n",
" }\n",
"\n",
" client = boto3.client('sagemaker')\n",
"\n",
" for status in eval_statuses:\n",
" logging.info(status.get(\"RuleEvaluationStatus\") + ', RuleEvaluationStatus=' + str(status))\n",
" if status.get(\"RuleEvaluationStatus\") == \"IssuesFound\":\n",
" secondary_status = event.get(\"detail\").get(\"SecondaryStatus\", None)\n",
" logging.info(\n",
" 'Evaluation of rule configuration {} resulted in \"IssuesFound\". '\n",
" 'Attempting to stop training job {}'.format(\n",
" status.get(\"RuleConfigurationName\"), training_job_name\n",
" )\n",
" f'About to stop training job, since evaluation of rule configuration {status.get(\"RuleConfigurationName\")} resulted in \"IssuesFound\". ' +\n",
" f'\\ntraining job \"{training_job_name}\" status is \"{training_job_status}\", secondary status is \"{secondary_status}\"' +\n",
" f'\\nAttempting to stop training job \"{training_job_name}\"'\n",
" )\n",
" try:\n",
" client.stop_training_job(\n",
Expand All @@ -90,6 +106,7 @@
"```\n",
"* Create a new execution role for the Lambda, and\n",
"* In your IAM console, search for the role and attach \"AmazonSageMakerFullAccess\" policy to the role. This is needed for the code in your Lambda function to stop the training job.\n",
"* Basic settings > set Timeout to 30 seconds instead of 3 seconds. \n",
"\n",
"#### Create a CloudWatch Rule\n",
"\n",
Expand Down