Skip to content

Commit 3583adb

Browse files
authored
Merge branch 'master' into airflow_model_export
2 parents 51cd999 + 6412991 commit 3583adb

25 files changed

+647
-112
lines changed

CHANGELOG.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ CHANGELOG
1212
* doc-fix: Add estimator base classes to API docs
1313
* feature: HyperparameterTuner: add support for Automatic Model Tuning's Warm Start Jobs
1414
* feature: HyperparameterTuner: Make input channels optional
15+
* feature: Add support for Chainer 5.0
16+
* feature: Estimator: add support for MetricDefinitions
1517

1618
1.14.2
1719
======
@@ -24,6 +26,7 @@ CHANGELOG
2426
* build: added pylint
2527
* build: upgrade docker-compose to 1.23
2628
* enhancement: Frameworks: update warning for not setting framework_version as we aren't planning a breaking change anymore
29+
* feature: Estimator: add script mode and Python 3 support for TensorFlow
2730
* enhancement: Session: remove hardcoded 'training' from job status error message
2831
* bug-fix: Updated Cloudwatch namespace for metrics in TrainingJobsAnalytics
2932
* bug-fix: Changes to use correct s3 bucket and time range for dataframes in TrainingJobAnalytics.

README.rst

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,25 @@ Here is an end to end example of how to use a SageMaker Estimator:
170170
# Tears down the SageMaker endpoint
171171
mxnet_estimator.delete_endpoint()
172172
173+
Training Metrics
174+
~~~~~~~~~~~~~~~~
175+
The SageMaker Python SDK allows you to specify a name and a regular expression for metrics you want to track for training.
176+
A regular expression (regex) matches what is in the training algorithm logs, like a search function.
177+
Here is an example of how to define metrics:
178+
179+
.. code:: python
180+
181+
# Configure an BYO Estimator with metric definitions (no training happens yet)
182+
byo_estimator = Estimator(image_name=image_name,
183+
role='SageMakerRole', train_instance_count=1,
184+
train_instance_type='ml.c4.xlarge',
185+
sagemaker_session=sagemaker_session,
186+
metric_definitions=[{'Name': 'test:msd', 'Regex': '#quality_metric: host=\S+, test msd <loss>=(\S+)'},
187+
{'Name': 'test:ssd', 'Regex': '#quality_metric: host=\S+, test ssd <loss>=(\S+)'}])
188+
189+
All Amazon SageMaker algorithms come with built-in support for metrics.
190+
You can go to `the AWS documentation <https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html>`__ for more details about built-in metrics of each Amazon SageMaker algorithm.
191+
173192
Local Mode
174193
~~~~~~~~~~
175194

@@ -358,7 +377,7 @@ Chainer SageMaker Estimators
358377
359378
By using Chainer SageMaker ``Estimators``, you can train and host Chainer models on Amazon SageMaker.
360379
361-
Supported versions of Chainer: ``4.0.0``, ``4.1.0``.
380+
Supported versions of Chainer: ``4.0.0``, ``4.1.0``, ``5.0.0``.
362381
363382
We recommend that you use the latest supported version, because that's where we focus most of our development efforts.
364383

src/sagemaker/chainer/README.rst

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Chainer SageMaker Estimators and Models
44

55
With Chainer Estimators, you can train and host Chainer models on Amazon SageMaker.
66

7-
Supported versions of Chainer: ``4.0.0``, ``4.1.0``
7+
Supported versions of Chainer: ``4.0.0``, ``4.1.0``, ``5.0.0``
88

99
You can visit the Chainer repository at https://github.com/chainer/chainer.
1010

@@ -32,7 +32,7 @@ Suppose that you already have an Chainer training script called
3232
role='SageMakerRole',
3333
train_instance_type='ml.p3.2xlarge',
3434
train_instance_count=1,
35-
framework_version='4.1.0')
35+
framework_version='5.0.0')
3636
chainer_estimator.fit('s3://bucket/path/to/training/data')
3737
3838
Where the S3 URL is a path to your training data, within Amazon S3. The constructor keyword arguments define how
@@ -111,7 +111,7 @@ directories ('train' and 'test').
111111
chainer_estimator = Chainer('chainer-train.py',
112112
train_instance_type='ml.p3.2xlarge',
113113
train_instance_count=1,
114-
framework_version='4.1.0',
114+
framework_version='5.0.0',
115115
hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1})
116116
chainer_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data',
117117
'test': 's3://my-data-bucket/path/to/my/test/data'})
@@ -285,7 +285,7 @@ operation.
285285
chainer_estimator = Chainer(entry_point='train_and_deploy.py',
286286
train_instance_type='ml.p3.2xlarge',
287287
train_instance_count=1,
288-
framework_version='4.1.0')
288+
framework_version='5.0.0')
289289
chainer_estimator.fit('s3://my_bucket/my_training_data/')
290290
291291
# Deploy my estimator to a SageMaker Endpoint and get a Predictor
@@ -631,38 +631,38 @@ This Python version applies to both the Training Job, created by fit, and the En
631631

632632
The Chainer Docker images have the following dependencies installed:
633633

634-
+-----------------------------+-------------+
635-
| Dependencies | chainer 4.0 |
636-
+-----------------------------+-------------+
637-
| chainer | 4.0.0 |
638-
+-----------------------------+-------------+
639-
| chainercv | 0.9.0 |
640-
+-----------------------------+-------------+
641-
| chainermn | 1.2.0 |
642-
+-----------------------------+-------------+
643-
| CUDA (GPU image only) | 9.0 |
644-
+-----------------------------+-------------+
645-
| cupy | 4.0.0 |
646-
+-----------------------------+-------------+
647-
| matplotlib | 2.2.0 |
648-
+-----------------------------+-------------+
649-
| mpi4py | 3.0.0 |
650-
+-----------------------------+-------------+
651-
| numpy | 1.14.3 |
652-
+-----------------------------+-------------+
653-
| opencv-python | 3.4.0.12 |
654-
+-----------------------------+-------------+
655-
| Pillow | 5.1.0 |
656-
+-----------------------------+-------------+
657-
| Python | 2.7 or 3.5 |
658-
+-----------------------------+-------------+
634+
+-----------------------------+-------------+-------------+-------------+
635+
| Dependencies | chainer 4.0 | chainer 4.1 | chainer 5.0 |
636+
+-----------------------------+-------------+-------------+-------------+
637+
| chainer | 4.0.0 | 4.1.0 | 5.0.0 |
638+
+-----------------------------+-------------+-------------+-------------+
639+
| chainercv | 0.9.0 | 0.10.0 | 0.10.0 |
640+
+-----------------------------+-------------+-------------+-------------+
641+
| chainermn | 1.2.0 | 1.3.0 | N/A |
642+
+-----------------------------+-------------+-------------+-------------+
643+
| CUDA (GPU image only) | 9.0 | 9.0 | 9.0 |
644+
+-----------------------------+-------------+-------------+-------------+
645+
| cupy | 4.0.0 | 4.1.0 | 5.0.0 |
646+
+-----------------------------+-------------+-------------+-------------+
647+
| matplotlib | 2.2.0 | 2.2.0 | 2.2.0 |
648+
+-----------------------------+-------------+-------------+-------------+
649+
| mpi4py | 3.0.0 | 3.0.0 | 3.0.0 |
650+
+-----------------------------+-------------+-------------+-------------+
651+
| numpy | 1.14.3 | 1.15.3 | 1.15.4 |
652+
+-----------------------------+-------------+-------------+-------------+
653+
| opencv-python | 3.4.0.12 | 3.4.0.12 | 3.4.0.12 |
654+
+-----------------------------+-------------+-------------+-------------+
655+
| Pillow | 5.1.0 | 5.3.0 | 5.3.0 |
656+
+-----------------------------+-------------+-------------+-------------+
657+
| Python | 2.7 or 3.5 | 2.7 or 3.5 | 2.7 or 3.5 |
658+
+-----------------------------+-------------+-------------+-------------+
659659

660660
The Docker images extend Ubuntu 16.04.
661661

662-
You can select version of Chainer by passing a framework_version keyword arg to the Chainer Estimator constructor.
663-
Currently supported versions are listed in the above table. You can also set framework_version to only specify major and
664-
minor version, which will cause your training script to be run on the latest supported patch version of that minor
665-
version.
662+
You must select a version of Chainer by passing a ``framework_version`` keyword arg to the Chainer Estimator
663+
constructor. Currently supported versions are listed in the above table. You can also set framework_version to only
664+
specify major and minor version, which will cause your training script to be run on the latest supported patch
665+
version of that minor version.
666666

667667
Alternatively, you can build your own image by following the instructions in the SageMaker Chainer containers
668668
repository, and passing ``image_name`` to the Chainer Estimator constructor.

src/sagemaker/chainer/estimator.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ class Chainer(Framework):
3535
_process_slots_per_host = "sagemaker_process_slots_per_host"
3636
_additional_mpi_options = "sagemaker_additional_mpi_options"
3737

38+
LATEST_VERSION = '5.0.0'
39+
3840
def __init__(self, entry_point, use_mpi=None, num_processes=None, process_slots_per_host=None,
3941
additional_mpi_options=None, source_dir=None, hyperparameters=None, py_version='py3',
4042
framework_version=None, image_name=None, **kwargs):
@@ -82,7 +84,7 @@ def __init__(self, entry_point, use_mpi=None, num_processes=None, process_slots_
8284
**kwargs: Additional kwargs passed to the :class:`~sagemaker.estimator.Framework` constructor.
8385
"""
8486
if framework_version is None:
85-
logger.warning(empty_framework_version_warning(CHAINER_VERSION, CHAINER_VERSION))
87+
logger.warning(empty_framework_version_warning(CHAINER_VERSION, self.LATEST_VERSION))
8688
self.framework_version = framework_version or CHAINER_VERSION
8789

8890
super(Chainer, self).__init__(entry_point, source_dir, hyperparameters,

src/sagemaker/estimator.py

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,8 @@ class EstimatorBase(with_metaclass(ABCMeta, object)):
5050
def __init__(self, role, train_instance_count, train_instance_type,
5151
train_volume_size=30, train_volume_kms_key=None, train_max_run=24 * 60 * 60, input_mode='File',
5252
output_path=None, output_kms_key=None, base_job_name=None, sagemaker_session=None, tags=None,
53-
subnets=None, security_group_ids=None, model_uri=None, model_channel_name='model'):
53+
subnets=None, security_group_ids=None, model_uri=None, model_channel_name='model',
54+
metric_definitions=None):
5455
"""Initialize an ``EstimatorBase`` instance.
5556
5657
Args:
@@ -97,6 +98,10 @@ def __init__(self, role, train_instance_count, train_instance_type,
9798
9899
More information: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#td-deserialization
99100
model_channel_name (str): Name of the channel where 'model_uri' will be downloaded (default: 'model').
101+
metric_definitions (list[dict]): A list of dictionaries that defines the metric(s) used to evaluate the
102+
training jobs. Each dictionary contains two keys: 'Name' for the name of the metric, and 'Regex' for
103+
the regular expression used to extract the metric from the logs. This should be defined only
104+
for jobs that don't use an Amazon algorithm.
100105
"""
101106
self.role = role
102107
self.train_instance_count = train_instance_count
@@ -106,6 +111,7 @@ def __init__(self, role, train_instance_count, train_instance_type,
106111
self.train_max_run = train_max_run
107112
self.input_mode = input_mode
108113
self.tags = tags
114+
self.metric_definitions = metric_definitions
109115
self.model_uri = model_uri
110116
self.model_channel_name = model_channel_name
111117

@@ -330,6 +336,9 @@ def _prepare_init_params_from_job_description(cls, job_details, model_channel_na
330336
init_params['hyperparameters'] = job_details['HyperParameters']
331337
init_params['image'] = job_details['AlgorithmSpecification']['TrainingImage']
332338

339+
if 'MetricDefinitons' in job_details['AlgorithmSpecification']:
340+
init_params['metric_definitions'] = job_details['AlgorithmSpecification']['MetricsDefinition']
341+
333342
subnets, security_group_ids = vpc_utils.from_dict(job_details.get(vpc_utils.VPC_CONFIG_KEY))
334343
if subnets:
335344
init_params['subnets'] = subnets
@@ -447,7 +456,7 @@ def start_new(cls, estimator, inputs):
447456
job_name=estimator._current_job_name, output_config=config['output_config'],
448457
resource_config=config['resource_config'], vpc_config=config['vpc_config'],
449458
hyperparameters=hyperparameters, stop_condition=config['stop_condition'],
450-
tags=estimator.tags)
459+
tags=estimator.tags, metric_definitions=estimator.metric_definitions)
451460

452461
return cls(estimator.sagemaker_session, estimator._current_job_name)
453462

@@ -472,7 +481,7 @@ def __init__(self, image_name, role, train_instance_count, train_instance_type,
472481
train_volume_size=30, train_volume_kms_key=None, train_max_run=24 * 60 * 60,
473482
input_mode='File', output_path=None, output_kms_key=None, base_job_name=None,
474483
sagemaker_session=None, hyperparameters=None, tags=None, subnets=None, security_group_ids=None,
475-
model_uri=None, model_channel_name='model'):
484+
model_uri=None, model_channel_name='model', metric_definitions=None):
476485
"""Initialize an ``Estimator`` instance.
477486
478487
Args:
@@ -523,14 +532,18 @@ def __init__(self, image_name, role, train_instance_count, train_instance_type,
523532
524533
More information: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#td-deserialization
525534
model_channel_name (str): Name of the channel where 'model_uri' will be downloaded (default: 'model').
535+
metric_definitions (list[dict]): A list of dictionaries that defines the metric(s) used to evaluate the
536+
training jobs. Each dictionary contains two keys: 'Name' for the name of the metric, and 'Regex' for
537+
the regular expression used to extract the metric from the logs. This should be defined only
538+
for jobs that don't use an Amazon algorithm.
526539
"""
527540
self.image_name = image_name
528541
self.hyperparam_dict = hyperparameters.copy() if hyperparameters else {}
529542
super(Estimator, self).__init__(role, train_instance_count, train_instance_type,
530543
train_volume_size, train_volume_kms_key, train_max_run, input_mode,
531544
output_path, output_kms_key, base_job_name, sagemaker_session,
532545
tags, subnets, security_group_ids, model_uri=model_uri,
533-
model_channel_name=model_channel_name)
546+
model_channel_name=model_channel_name, metric_definitions=metric_definitions)
534547

535548
def train_image(self):
536549
"""
@@ -616,6 +629,7 @@ class Framework(EstimatorBase):
616629
"""
617630

618631
__framework_name__ = None
632+
LAUNCH_PS_ENV_NAME = 'sagemaker_parameter_server_enabled'
619633

620634
def __init__(self, entry_point, source_dir=None, hyperparameters=None, enable_cloudwatch_metrics=False,
621635
container_log_level=logging.INFO, code_location=None, image_name=None, **kwargs):

src/sagemaker/fw_utils.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,9 @@
3131
'If you would like to use version {latest}, ' \
3232
'please add framework_version={latest} to your constructor.'
3333

34+
EMPTY_FRAMEWORK_VERSION_ERROR = 'framework_version is required for script mode estimator. ' \
35+
'Please add framework_version={} to your constructor to avoid this error.'
36+
3437
VALID_PY_VERSIONS = ['py2', 'py3']
3538

3639

src/sagemaker/mxnet/estimator.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@ class MXNet(Framework):
3030
__framework_name__ = 'mxnet'
3131

3232
_LOWEST_SCRIPT_MODE_VERSION = ['1', '3']
33-
LAUNCH_PS_ENV_NAME = 'sagemaker_parameter_server_enabled'
3433
LATEST_VERSION = '1.3'
3534

3635
def __init__(self, entry_point, source_dir=None, hyperparameters=None, py_version='py2',

src/sagemaker/session.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ def default_bucket(self):
203203
return self._default_bucket
204204

205205
def train(self, image, input_mode, input_config, role, job_name, output_config,
206-
resource_config, vpc_config, hyperparameters, stop_condition, tags):
206+
resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions):
207207
"""Create an Amazon SageMaker training job.
208208
209209
Args:
@@ -243,6 +243,9 @@ def train(self, image, input_mode, input_config, role, job_name, output_config,
243243
service like ``MaxRuntimeInSeconds``.
244244
tags (list[dict]): List of tags for labeling a training job. For more, see
245245
https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
246+
metric_definitions (list[dict]): A list of dictionaries that defines the metric(s) used to evaluate the
247+
training jobs. Each dictionary contains two keys: 'Name' for the name of the metric, and 'Regex' for
248+
the regular expression used to extract the metric from the logs.
246249
247250
Returns:
248251
str: ARN of the training job, if it is created.
@@ -263,6 +266,9 @@ def train(self, image, input_mode, input_config, role, job_name, output_config,
263266
if input_config is not None:
264267
train_request['InputDataConfig'] = input_config
265268

269+
if metric_definitions is not None:
270+
train_request['AlgorithmSpecification']['MetricDefinitions'] = metric_definitions
271+
266272
if hyperparameters and len(hyperparameters) > 0:
267273
train_request['HyperParameters'] = hyperparameters
268274

@@ -306,7 +312,7 @@ def tune(self, job_name, strategy, objective_type, objective_metric_name,
306312
metric_definitions (list[dict]): A list of dictionaries that defines the metric(s) used to evaluate the
307313
training jobs. Each dictionary contains two keys: 'Name' for the name of the metric, and 'Regex' for
308314
the regular expression used to extract the metric from the logs. This should be defined only for
309-
hyperparameter tuning jobs that don't use an Amazon algorithm.
315+
jobs that don't use an Amazon algorithm.
310316
role (str): An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs
311317
that create Amazon SageMaker endpoints use this role to access training data and model artifacts.
312318
You must grant sufficient permissions to this role.

0 commit comments

Comments
 (0)