Skip to content

Commit b33d1a2

Browse files
laurenyuknakad
authored andcommitted
feature: support training inputs from EFS and FSx (#991)
Amazon SageMaker now supports Amazon EFS and Amazon FSx for Lustre file systems as data sources for training machine learning models.
1 parent ee53bd4 commit b33d1a2

19 files changed

+1657
-169
lines changed

doc/overview.rst

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,114 @@ Here are some examples of creating estimators with Git support:
299299
Git support can be used not only for training jobs, but also for hosting models. The usage is the same as the above,
300300
and ``git_config`` should be provided when creating model objects, e.g. ``TensorFlowModel``, ``MXNetModel``, ``PyTorchModel``.
301301

302+
Use File Systems as Training Inputs
303+
-------------------------------------
304+
Amazon SageMaker supports using Amazon Elastic File System (EFS) and FSx for Lustre as data sources to use during training.
305+
If you want use those data sources, create a file system (EFS/FSx) and mount the file system on an Amazon EC2 instance.
306+
For more information about setting up EFS and FSx, see the following documentation:
307+
308+
- `Using File Systems in Amazon EFS <https://docs.aws.amazon.com/efs/latest/ug/using-fs.html>`__
309+
- `Getting Started with Amazon FSx for Lustre <https://aws.amazon.com/fsx/lustre/getting-started/>`__
310+
311+
The general experience uses either the ``FileSystemInput`` or ``FileSystemRecordSet`` class, which encapsulates
312+
all of the necessary arguments required by the service to use EFS or Lustre.
313+
314+
Here are examples of how to use Amazon EFS as input for training:
315+
316+
.. code:: python
317+
318+
# This example shows how to use FileSystemInput class
319+
# Configure an estimator with subnets and security groups from your VPC. The EFS volume must be in
320+
# the same VPC as your Amazon EC2 instance
321+
estimator = TensorFlow(entry_point='tensorflow_mnist/mnist.py',
322+
role='SageMakerRole',
323+
train_instance_count=1,
324+
train_instance_type='ml.c4.xlarge',
325+
subnets=['subnet-1', 'subnet-2']
326+
security_group_ids=['sg-1'])
327+
328+
file_system_input = FileSystemInput(file_system_id='fs-1',
329+
file_system_type='EFS',
330+
directory_path='tensorflow',
331+
file_system_access_mode='ro')
332+
333+
# Start an Amazon SageMaker training job with EFS using the FileSystemInput class
334+
estimator.fit(file_system_input)
335+
336+
.. code:: python
337+
338+
# This example shows how to use FileSystemRecordSet class
339+
# Configure an estimator with subnets and security groups from your VPC. The EFS volume must be in
340+
# the same VPC as your Amazon EC2 instance
341+
kmeans = KMeans(role='SageMakerRole',
342+
train_instance_count=1,
343+
train_instance_type='ml.c4.xlarge',
344+
k=10,
345+
subnets=['subnet-1', 'subnet-2'],
346+
security_group_ids=['sg-1'])
347+
348+
records = FileSystemRecordSet(file_system_id='fs-1,
349+
file_system_type='EFS',
350+
directory_path='kmeans',
351+
num_records=784,
352+
feature_dim=784)
353+
354+
# Start an Amazon SageMaker training job with EFS using the FileSystemRecordSet class
355+
kmeans.fit(records)
356+
357+
Here are examples of how to use Amazon FSx for Lustre as input for training:
358+
359+
.. code:: python
360+
361+
# This example shows how to use FileSystemInput class
362+
# Configure an estimator with subnets and security groups from your VPC. The VPC should be the same as that
363+
# you chose for your Amazon EC2 instance
364+
365+
estimator = TensorFlow(entry_point='tensorflow_mnist/mnist.py',
366+
role='SageMakerRole',
367+
train_instance_count=1,
368+
train_instance_type='ml.c4.xlarge',
369+
subnets=['subnet-1', 'subnet-2']
370+
security_group_ids=['sg-1'])
371+
372+
373+
file_system_input = FileSystemInput(file_system_id='fs-2',
374+
file_system_type='FSxLustre',
375+
directory_path='tensorflow',
376+
file_system_access_mode='ro')
377+
378+
# Start an Amazon SageMaker training job with FSx using the FileSystemInput class
379+
estimator.fit(file_system_input)
380+
381+
.. code:: python
382+
383+
# This example shows how to use FileSystemRecordSet class
384+
# Configure an estimator with subnets and security groups from your VPC. The VPC should be the same as that
385+
# you chose for your Amazon EC2 instance
386+
kmeans = KMeans(role='SageMakerRole',
387+
train_instance_count=1,
388+
train_instance_type='ml.c4.xlarge',
389+
k=10,
390+
subnets=['subnet-1', 'subnet-2'],
391+
security_group_ids=['sg-1'])
392+
393+
records = FileSystemRecordSet(file_system_id='fs-=2,
394+
file_system_type='FSxLustre',
395+
directory_path='kmeans',
396+
num_records=784,
397+
feature_dim=784)
398+
399+
# Start an Amazon SageMaker training job with FSx using the FileSystemRecordSet class
400+
kmeans.fit(records)
401+
402+
Data sources from EFS and FSx can also be used for hyperparameter tuning jobs. The usage is the same as above.
403+
404+
A few important notes:
405+
406+
- Local mode is not supported if using EFS and FSx as data sources
407+
408+
- Pipe mode is not supported if using EFS as data source
409+
302410
Training Metrics
303411
----------------
304412
The SageMaker Python SDK allows you to specify a name and a regular expression for metrics you want to track for training.

setup.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
1+
# Copyright 2017-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License"). You
44
# may not use this file except in compliance with the License. A copy of
@@ -34,14 +34,15 @@ def read_version():
3434

3535
# Declare minimal set for installation
3636
required_packages = [
37-
"boto3>=1.9.169",
37+
"boto3>=1.9.213",
3838
"numpy>=1.9.0",
3939
"protobuf>=3.1",
4040
"scipy>=0.19.0",
4141
"urllib3>=1.21, <1.25",
4242
"protobuf3-to-dict>=0.1.5",
4343
"docker-compose>=1.23.0",
4444
"requests>=2.20.0, <2.21",
45+
"fabric>=2.0",
4546
]
4647

4748
# enum is introduced in Python 3.4. Installing enum back port

src/sagemaker/amazon/amazon_estimator.py

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
1+
# Copyright 2017-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License"). You
44
# may not use this file except in compliance with the License. A copy of
@@ -23,6 +23,7 @@
2323
from sagemaker.amazon.hyperparameter import Hyperparameter as hp # noqa
2424
from sagemaker.amazon.common import write_numpy_to_dense_tensor
2525
from sagemaker.estimator import EstimatorBase, _TrainingJob
26+
from sagemaker.inputs import FileSystemInput
2627
from sagemaker.model import NEO_IMAGE_ACCOUNT
2728
from sagemaker.session import s3_input
2829
from sagemaker.utils import sagemaker_timestamp, get_ecr_image_uri_prefix
@@ -281,6 +282,55 @@ def records_s3_input(self):
281282
return s3_input(self.s3_data, distribution="ShardedByS3Key", s3_data_type=self.s3_data_type)
282283

283284

285+
class FileSystemRecordSet(object):
286+
"""Amazon SageMaker channel configuration for a file system data source
287+
for Amazon algorithms.
288+
"""
289+
290+
def __init__(
291+
self,
292+
file_system_id,
293+
file_system_type,
294+
directory_path,
295+
num_records,
296+
feature_dim,
297+
file_system_access_mode="ro",
298+
channel="train",
299+
):
300+
"""Initialize a ``FileSystemRecordSet`` object.
301+
302+
Args:
303+
file_system_id (str): An Amazon file system ID starting with 'fs-'.
304+
file_system_type (str): The type of file system used for the input.
305+
Valid values: 'EFS', 'FSxLustre'.
306+
directory_path (str): Relative path to the root directory (mount point) in
307+
the file system. Reference:
308+
https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html and
309+
https://docs.aws.amazon.com/efs/latest/ug/wt1-test.html
310+
num_records (int): The number of records in the set.
311+
feature_dim (int): The dimensionality of "values" arrays in the Record features,
312+
and label (if each Record is labeled).
313+
file_system_access_mode (str): Permissions for read and write.
314+
Valid values: 'ro' or 'rw'. Defaults to 'ro'.
315+
channel (str): The SageMaker Training Job channel this RecordSet should be bound to
316+
"""
317+
318+
self.file_system_input = FileSystemInput(
319+
file_system_id, file_system_type, directory_path, file_system_access_mode
320+
)
321+
self.feature_dim = feature_dim
322+
self.num_records = num_records
323+
self.channel = channel
324+
325+
def __repr__(self):
326+
"""Return an unambiguous representation of this RecordSet"""
327+
return str((FileSystemRecordSet, self.__dict__))
328+
329+
def data_channel(self):
330+
"""Return a dictionary to represent the training data in a channel for use with ``fit()``"""
331+
return {self.channel: self.file_system_input}
332+
333+
284334
def _build_shards(num_shards, array):
285335
"""
286336
Args:

src/sagemaker/estimator.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -308,21 +308,21 @@ def fit(self, inputs=None, wait=True, logs=True, job_name=None):
308308
about the training data. This can be one of three types:
309309
310310
* (str) the S3 location where training data is saved.
311-
312311
* (dict[str, str] or dict[str, sagemaker.session.s3_input]) If using multiple
313312
channels for training data, you can specify a dict mapping channel names to
314313
strings or :func:`~sagemaker.session.s3_input` objects.
315-
316314
* (sagemaker.session.s3_input) - channel configuration for S3 data sources that can
317315
provide additional information as well as the path to the training dataset.
318316
See :func:`sagemaker.session.s3_input` for full details.
319-
wait (bool): Whether the call should wait until the job completes
320-
(default: True).
321-
logs (bool): Whether to show the logs produced by the job. Only
322-
meaningful when wait is True (default: True).
323-
job_name (str): Training job name. If not specified, the estimator
324-
generates a default job name, based on the training image name
325-
and current timestamp.
317+
* (sagemaker.session.FileSystemInput) - channel configuration for
318+
a file system data source that can provide additional information as well as
319+
the path to the training dataset.
320+
321+
wait (bool): Whether the call should wait until the job completes (default: True).
322+
logs (bool): Whether to show the logs produced by the job.
323+
Only meaningful when wait is True (default: True).
324+
job_name (str): Training job name. If not specified, the estimator generates
325+
a default job name, based on the training image name and current timestamp.
326326
"""
327327
self._prepare_for_training(job_name=job_name)
328328

src/sagemaker/inputs.py

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Copyright 2017-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License"). You
4+
# may not use this file except in compliance with the License. A copy of
5+
# the License is located at
6+
#
7+
# http://aws.amazon.com/apache2.0/
8+
#
9+
# or in the "license" file accompanying this file. This file is
10+
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11+
# ANY KIND, either express or implied. See the License for the specific
12+
# language governing permissions and limitations under the License.
13+
"""Amazon SageMaker channel configurations for S3 data sources and file system data sources"""
14+
from __future__ import absolute_import, print_function
15+
16+
FILE_SYSTEM_TYPES = ["FSxLustre", "EFS"]
17+
FILE_SYSTEM_ACCESS_MODES = ["ro", "rw"]
18+
19+
20+
class s3_input(object):
21+
"""Amazon SageMaker channel configurations for S3 data sources.
22+
23+
Attributes:
24+
config (dict[str, dict]): A SageMaker ``DataSource`` referencing
25+
a SageMaker ``S3DataSource``.
26+
"""
27+
28+
def __init__(
29+
self,
30+
s3_data,
31+
distribution="FullyReplicated",
32+
compression=None,
33+
content_type=None,
34+
record_wrapping=None,
35+
s3_data_type="S3Prefix",
36+
input_mode=None,
37+
attribute_names=None,
38+
shuffle_config=None,
39+
):
40+
"""Create a definition for input data used by an SageMaker training job.
41+
See AWS documentation on the ``CreateTrainingJob`` API for more details on the parameters.
42+
43+
Args:
44+
s3_data (str): Defines the location of s3 data to train on.
45+
distribution (str): Valid values: 'FullyReplicated', 'ShardedByS3Key'
46+
(default: 'FullyReplicated').
47+
compression (str): Valid values: 'Gzip', None (default: None). This is used only in
48+
Pipe input mode.
49+
content_type (str): MIME type of the input data (default: None).
50+
record_wrapping (str): Valid values: 'RecordIO' (default: None).
51+
s3_data_type (str): Valid values: 'S3Prefix', 'ManifestFile', 'AugmentedManifestFile'.
52+
If 'S3Prefix', ``s3_data`` defines a prefix of s3 objects to train on.
53+
All objects with s3 keys beginning with ``s3_data`` will be used to train.
54+
If 'ManifestFile' or 'AugmentedManifestFile', then ``s3_data`` defines a
55+
single S3 manifest file or augmented manifest file (respectively),
56+
listing the S3 data to train on. Both the ManifestFile and
57+
AugmentedManifestFile formats are described in the SageMaker API documentation:
58+
https://docs.aws.amazon.com/sagemaker/latest/dg/API_S3DataSource.html
59+
input_mode (str): Optional override for this channel's input mode (default: None).
60+
By default, channels will use the input mode defined on
61+
``sagemaker.estimator.EstimatorBase.input_mode``, but they will ignore
62+
that setting if this parameter is set.
63+
64+
* None - Amazon SageMaker will use the input mode specified in the ``Estimator``
65+
* 'File' - Amazon SageMaker copies the training dataset from the S3 location to
66+
a local directory.
67+
* 'Pipe' - Amazon SageMaker streams data directly from S3 to the container via
68+
a Unix-named pipe.
69+
70+
attribute_names (list[str]): A list of one or more attribute names to use that are
71+
found in a specified AugmentedManifestFile.
72+
shuffle_config (ShuffleConfig): If specified this configuration enables shuffling on
73+
this channel. See the SageMaker API documentation for more info:
74+
https://docs.aws.amazon.com/sagemaker/latest/dg/API_ShuffleConfig.html
75+
"""
76+
77+
self.config = {
78+
"DataSource": {
79+
"S3DataSource": {
80+
"S3DataDistributionType": distribution,
81+
"S3DataType": s3_data_type,
82+
"S3Uri": s3_data,
83+
}
84+
}
85+
}
86+
87+
if compression is not None:
88+
self.config["CompressionType"] = compression
89+
if content_type is not None:
90+
self.config["ContentType"] = content_type
91+
if record_wrapping is not None:
92+
self.config["RecordWrapperType"] = record_wrapping
93+
if input_mode is not None:
94+
self.config["InputMode"] = input_mode
95+
if attribute_names is not None:
96+
self.config["DataSource"]["S3DataSource"]["AttributeNames"] = attribute_names
97+
if shuffle_config is not None:
98+
self.config["ShuffleConfig"] = {"Seed": shuffle_config.seed}
99+
100+
101+
class FileSystemInput(object):
102+
"""Amazon SageMaker channel configurations for file system data sources.
103+
104+
Attributes:
105+
config (dict[str, dict]): A Sagemaker File System ``DataSource``.
106+
"""
107+
108+
def __init__(
109+
self, file_system_id, file_system_type, directory_path, file_system_access_mode="ro"
110+
):
111+
"""Create a new file system input used by an SageMaker training job.
112+
113+
Args:
114+
file_system_id (str): An Amazon file system ID starting with 'fs-'.
115+
file_system_type (str): The type of file system used for the input.
116+
Valid values: 'EFS', 'FSxLustre'.
117+
directory_path (str): Relative path to the root directory (mount point) in
118+
the file system.
119+
Reference: https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html and
120+
https://docs.aws.amazon.com/fsx/latest/LustreGuide/mount-fs-auto-mount-onreboot.html
121+
file_system_access_mode (str): Permissions for read and write.
122+
Valid values: 'ro' or 'rw'. Defaults to 'ro'.
123+
"""
124+
125+
if file_system_type not in FILE_SYSTEM_TYPES:
126+
raise ValueError(
127+
"Unrecognized file system type: %s. Valid values: %s."
128+
% (file_system_type, ", ".join(FILE_SYSTEM_TYPES))
129+
)
130+
131+
if file_system_access_mode not in FILE_SYSTEM_ACCESS_MODES:
132+
raise ValueError(
133+
"Unrecognized file system access mode: %s. Valid values: %s."
134+
% (file_system_access_mode, ", ".join(FILE_SYSTEM_ACCESS_MODES))
135+
)
136+
137+
self.config = {
138+
"DataSource": {
139+
"FileSystemDataSource": {
140+
"FileSystemId": file_system_id,
141+
"FileSystemType": file_system_type,
142+
"DirectoryPath": directory_path,
143+
"FileSystemAccessMode": file_system_access_mode,
144+
}
145+
}
146+
}

0 commit comments

Comments
 (0)