Skip to content

Commit aa22e25

Browse files
authored
avoid issuing more than one stop job requests (#1791)
- avoid issuing more than one stop job requests (stop only if job status is InProgress). - fix logging to write to lambda logs - ask user to configure a 30sec timeout instead of 3sec (saw it timing out).
1 parent de1412c commit aa22e25

File tree

1 file changed

+22
-5
lines changed

1 file changed

+22
-5
lines changed

sagemaker-debugger/tensorflow_action_on_rule/tf-mnist-stop-training-job.ipynb

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,21 @@
4646
"\n",
4747
"* In your AWS console, go to Lambda Management Console,\n",
4848
"* Create a new function by hitting Create Function,\n",
49-
"* Choose the language as Python 3.7 and put in the following sample code for stopping the training job if one of the Rule statuses is `\"IssuesFound\"`:\n",
49+
"* Choose the language as Python 3.7 (or higher) and put in the following sample code for stopping the training job if one of the Rule statuses is `\"IssuesFound\"`:\n",
5050
"\n",
5151
"```python\n",
5252
"import json\n",
5353
"import boto3\n",
5454
"import logging\n",
5555
"\n",
56+
"logger = logging.getLogger()\n",
57+
"logger.setLevel(logging.INFO)\n",
58+
"\n",
59+
"\n",
5660
"def lambda_handler(event, context):\n",
5761
" training_job_name = event.get(\"detail\").get(\"TrainingJobName\")\n",
62+
" logging.info(f'Evaluating Debugger rules for training job: {training_job_name}')\n",
63+
"\n",
5864
" eval_statuses = event.get(\"detail\").get(\"DebugRuleEvaluationStatuses\", None)\n",
5965
"\n",
6066
" if eval_statuses is None or len(eval_statuses) == 0:\n",
@@ -64,15 +70,25 @@
6470
" 'body': json.dumps('Nothing to do')\n",
6571
" }\n",
6672
"\n",
73+
" # should only attempt stopping jobs with InProgress status\n",
74+
" training_job_status = event.get(\"detail\").get(\"TrainingJobStatus\", None)\n",
75+
" if training_job_status != 'InProgress':\n",
76+
" logging.debug(f\"Current Training job status({training_job_status}) is not 'InProgress'. Exiting\")\n",
77+
" return {\n",
78+
" 'statusCode': 200,\n",
79+
" 'body': json.dumps('Nothing to do')\n",
80+
" }\n",
81+
"\n",
6782
" client = boto3.client('sagemaker')\n",
6883
"\n",
6984
" for status in eval_statuses:\n",
85+
" logging.info(status.get(\"RuleEvaluationStatus\") + ', RuleEvaluationStatus=' + str(status))\n",
7086
" if status.get(\"RuleEvaluationStatus\") == \"IssuesFound\":\n",
87+
" secondary_status = event.get(\"detail\").get(\"SecondaryStatus\", None)\n",
7188
" logging.info(\n",
72-
" 'Evaluation of rule configuration {} resulted in \"IssuesFound\". '\n",
73-
" 'Attempting to stop training job {}'.format(\n",
74-
" status.get(\"RuleConfigurationName\"), training_job_name\n",
75-
" )\n",
89+
" f'About to stop training job, since evaluation of rule configuration {status.get(\"RuleConfigurationName\")} resulted in \"IssuesFound\". ' +\n",
90+
" f'\\ntraining job \"{training_job_name}\" status is \"{training_job_status}\", secondary status is \"{secondary_status}\"' +\n",
91+
" f'\\nAttempting to stop training job \"{training_job_name}\"'\n",
7692
" )\n",
7793
" try:\n",
7894
" client.stop_training_job(\n",
@@ -90,6 +106,7 @@
90106
"```\n",
91107
"* Create a new execution role for the Lambda, and\n",
92108
"* In your IAM console, search for the role and attach \"AmazonSageMakerFullAccess\" policy to the role. This is needed for the code in your Lambda function to stop the training job.\n",
109+
"* Basic settings > set Timeout to 30 seconds instead of 3 seconds. \n",
93110
"\n",
94111
"#### Create a CloudWatch Rule\n",
95112
"\n",

0 commit comments

Comments
 (0)