Skip to content

feature: support training inputs from EFS and FSx #991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Aug 23, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,114 @@ Here are some examples of creating estimators with Git support:
Git support can be used not only for training jobs, but also for hosting models. The usage is the same as the above,
and ``git_config`` should be provided when creating model objects, e.g. ``TensorFlowModel``, ``MXNetModel``, ``PyTorchModel``.

Use File Systems as Training Inputs
-------------------------------------
Amazon SageMaker supports using Amazon Elastic File System (EFS) and FSx for Lustre as data sources to use during training.
If you want use those data sources, create a file system (EFS/FSx) and mount the file system on an Amazon EC2 instance.
For more information about setting up EFS and FSx, see the following documentation:

- `Using File Systems in Amazon EFS <https://docs.aws.amazon.com/efs/latest/ug/using-fs.html>`__
- `Getting Started with Amazon FSx for Lustre <https://aws.amazon.com/fsx/lustre/getting-started/>`__

The general experience uses either the ``FileSystemInput`` or ``FileSystemRecordSet`` class, which encapsulates
all of the necessary arguments required by the service to use EFS or Lustre.

Here are examples of how to use Amazon EFS as input for training:

.. code:: python

# This example shows how to use FileSystemInput class
# Configure an estimator with subnets and security groups from your VPC. The EFS volume must be in
# the same VPC as your Amazon EC2 instance
estimator = TensorFlow(entry_point='tensorflow_mnist/mnist.py',
role='SageMakerRole',
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
subnets=['subnet-1', 'subnet-2']
security_group_ids=['sg-1'])

file_system_input = FileSystemInput(file_system_id='fs-1',
file_system_type='EFS',
directory_path='tensorflow',
file_system_access_mode='ro')

# Start an Amazon SageMaker training job with EFS using the FileSystemInput class
estimator.fit(file_system_input)

.. code:: python

# This example shows how to use FileSystemRecordSet class
# Configure an estimator with subnets and security groups from your VPC. The EFS volume must be in
# the same VPC as your Amazon EC2 instance
kmeans = KMeans(role='SageMakerRole',
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
k=10,
subnets=['subnet-1', 'subnet-2'],
security_group_ids=['sg-1'])

records = FileSystemRecordSet(file_system_id='fs-1,
file_system_type='EFS',
directory_path='kmeans',
num_records=784,
feature_dim=784)

# Start an Amazon SageMaker training job with EFS using the FileSystemRecordSet class
kmeans.fit(records)

Here are examples of how to use Amazon FSx for Lustre as input for training:

.. code:: python

# This example shows how to use FileSystemInput class
# Configure an estimator with subnets and security groups from your VPC. The VPC should be the same as that
# you chose for your Amazon EC2 instance

estimator = TensorFlow(entry_point='tensorflow_mnist/mnist.py',
role='SageMakerRole',
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
subnets=['subnet-1', 'subnet-2']
security_group_ids=['sg-1'])


file_system_input = FileSystemInput(file_system_id='fs-2',
file_system_type='FSxLustre',
directory_path='tensorflow',
file_system_access_mode='ro')

# Start an Amazon SageMaker training job with FSx using the FileSystemInput class
estimator.fit(file_system_input)

.. code:: python

# This example shows how to use FileSystemRecordSet class
# Configure an estimator with subnets and security groups from your VPC. The VPC should be the same as that
# you chose for your Amazon EC2 instance
kmeans = KMeans(role='SageMakerRole',
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
k=10,
subnets=['subnet-1', 'subnet-2'],
security_group_ids=['sg-1'])

records = FileSystemRecordSet(file_system_id='fs-=2,
file_system_type='FSxLustre',
directory_path='kmeans',
num_records=784,
feature_dim=784)

# Start an Amazon SageMaker training job with FSx using the FileSystemRecordSet class
kmeans.fit(records)

Data sources from EFS and FSx can also be used for hyperparameter tuning jobs. The usage is the same as above.

A few important notes:

- Local mode is not supported if using EFS and FSx as data sources

- Pipe mode is not supported if using EFS as data source

Training Metrics
----------------
The SageMaker Python SDK allows you to specify a name and a regular expression for metrics you want to track for training.
Expand Down
5 changes: 3 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
# Copyright 2017-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
Expand Down Expand Up @@ -34,14 +34,15 @@ def read_version():

# Declare minimal set for installation
required_packages = [
"boto3>=1.9.169",
"boto3>=1.9.213",
"numpy>=1.9.0",
"protobuf>=3.1",
"scipy>=0.19.0",
"urllib3>=1.21, <1.25",
"protobuf3-to-dict>=0.1.5",
"docker-compose>=1.23.0",
"requests>=2.20.0, <2.21",
"fabric>=2.0",
]

# enum is introduced in Python 3.4. Installing enum back port
Expand Down
52 changes: 51 additions & 1 deletion src/sagemaker/amazon/amazon_estimator.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
# Copyright 2017-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
Expand All @@ -23,6 +23,7 @@
from sagemaker.amazon.hyperparameter import Hyperparameter as hp # noqa
from sagemaker.amazon.common import write_numpy_to_dense_tensor
from sagemaker.estimator import EstimatorBase, _TrainingJob
from sagemaker.inputs import FileSystemInput
from sagemaker.model import NEO_IMAGE_ACCOUNT
from sagemaker.session import s3_input
from sagemaker.utils import sagemaker_timestamp, get_ecr_image_uri_prefix
Expand Down Expand Up @@ -281,6 +282,55 @@ def records_s3_input(self):
return s3_input(self.s3_data, distribution="ShardedByS3Key", s3_data_type=self.s3_data_type)


class FileSystemRecordSet(object):
"""Amazon SageMaker channel configuration for a file system data source
for Amazon algorithms.
"""

def __init__(
self,
file_system_id,
file_system_type,
directory_path,
num_records,
feature_dim,
file_system_access_mode="ro",
channel="train",
):
"""Initialize a ``FileSystemRecordSet`` object.

Args:
file_system_id (str): An Amazon file system ID starting with 'fs-'.
file_system_type (str): The type of file system used for the input.
Valid values: 'EFS', 'FSxLustre'.
directory_path (str): Relative path to the root directory (mount point) in
the file system. Reference:
https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html and
https://docs.aws.amazon.com/efs/latest/ug/wt1-test.html
num_records (int): The number of records in the set.
feature_dim (int): The dimensionality of "values" arrays in the Record features,
and label (if each Record is labeled).
file_system_access_mode (str): Permissions for read and write.
Valid values: 'ro' or 'rw'. Defaults to 'ro'.
channel (str): The SageMaker Training Job channel this RecordSet should be bound to
"""

self.file_system_input = FileSystemInput(
file_system_id, file_system_type, directory_path, file_system_access_mode
)
self.feature_dim = feature_dim
self.num_records = num_records
self.channel = channel

def __repr__(self):
"""Return an unambiguous representation of this RecordSet"""
return str((FileSystemRecordSet, self.__dict__))

def data_channel(self):
"""Return a dictionary to represent the training data in a channel for use with ``fit()``"""
return {self.channel: self.file_system_input}


def _build_shards(num_shards, array):
"""
Args:
Expand Down
18 changes: 9 additions & 9 deletions src/sagemaker/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,21 +308,21 @@ def fit(self, inputs=None, wait=True, logs=True, job_name=None):
about the training data. This can be one of three types:

* (str) the S3 location where training data is saved.

* (dict[str, str] or dict[str, sagemaker.session.s3_input]) If using multiple
channels for training data, you can specify a dict mapping channel names to
strings or :func:`~sagemaker.session.s3_input` objects.

* (sagemaker.session.s3_input) - channel configuration for S3 data sources that can
provide additional information as well as the path to the training dataset.
See :func:`sagemaker.session.s3_input` for full details.
wait (bool): Whether the call should wait until the job completes
(default: True).
logs (bool): Whether to show the logs produced by the job. Only
meaningful when wait is True (default: True).
job_name (str): Training job name. If not specified, the estimator
generates a default job name, based on the training image name
and current timestamp.
* (sagemaker.session.FileSystemInput) - channel configuration for
a file system data source that can provide additional information as well as
the path to the training dataset.

wait (bool): Whether the call should wait until the job completes (default: True).
logs (bool): Whether to show the logs produced by the job.
Only meaningful when wait is True (default: True).
job_name (str): Training job name. If not specified, the estimator generates
a default job name, based on the training image name and current timestamp.
"""
self._prepare_for_training(job_name=job_name)

Expand Down
146 changes: 146 additions & 0 deletions src/sagemaker/inputs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Copyright 2017-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.
"""Amazon SageMaker channel configurations for S3 data sources and file system data sources"""
from __future__ import absolute_import, print_function

FILE_SYSTEM_TYPES = ["FSxLustre", "EFS"]
FILE_SYSTEM_ACCESS_MODES = ["ro", "rw"]


class s3_input(object):
"""Amazon SageMaker channel configurations for S3 data sources.

Attributes:
config (dict[str, dict]): A SageMaker ``DataSource`` referencing
a SageMaker ``S3DataSource``.
"""

def __init__(
self,
s3_data,
distribution="FullyReplicated",
compression=None,
content_type=None,
record_wrapping=None,
s3_data_type="S3Prefix",
input_mode=None,
attribute_names=None,
shuffle_config=None,
):
"""Create a definition for input data used by an SageMaker training job.
See AWS documentation on the ``CreateTrainingJob`` API for more details on the parameters.

Args:
s3_data (str): Defines the location of s3 data to train on.
distribution (str): Valid values: 'FullyReplicated', 'ShardedByS3Key'
(default: 'FullyReplicated').
compression (str): Valid values: 'Gzip', None (default: None). This is used only in
Pipe input mode.
content_type (str): MIME type of the input data (default: None).
record_wrapping (str): Valid values: 'RecordIO' (default: None).
s3_data_type (str): Valid values: 'S3Prefix', 'ManifestFile', 'AugmentedManifestFile'.
If 'S3Prefix', ``s3_data`` defines a prefix of s3 objects to train on.
All objects with s3 keys beginning with ``s3_data`` will be used to train.
If 'ManifestFile' or 'AugmentedManifestFile', then ``s3_data`` defines a
single S3 manifest file or augmented manifest file (respectively),
listing the S3 data to train on. Both the ManifestFile and
AugmentedManifestFile formats are described in the SageMaker API documentation:
https://docs.aws.amazon.com/sagemaker/latest/dg/API_S3DataSource.html
input_mode (str): Optional override for this channel's input mode (default: None).
By default, channels will use the input mode defined on
``sagemaker.estimator.EstimatorBase.input_mode``, but they will ignore
that setting if this parameter is set.

* None - Amazon SageMaker will use the input mode specified in the ``Estimator``
* 'File' - Amazon SageMaker copies the training dataset from the S3 location to
a local directory.
* 'Pipe' - Amazon SageMaker streams data directly from S3 to the container via
a Unix-named pipe.

attribute_names (list[str]): A list of one or more attribute names to use that are
found in a specified AugmentedManifestFile.
shuffle_config (ShuffleConfig): If specified this configuration enables shuffling on
this channel. See the SageMaker API documentation for more info:
https://docs.aws.amazon.com/sagemaker/latest/dg/API_ShuffleConfig.html
"""

self.config = {
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": distribution,
"S3DataType": s3_data_type,
"S3Uri": s3_data,
}
}
}

if compression is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] Can these be written as one-liners with "or"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly sure what you're suggesting here. are you thinking self.config["CompressionType"] = compression or None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was thinking. Am I missing something that would make that incorrect?

Copy link
Contributor Author

@laurenyu laurenyu Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boto3 may complain about parameters being set to None. The error will be something like "incorrect parameter type <NoneType>, should be one of: <str>"

self.config["CompressionType"] = compression
if content_type is not None:
self.config["ContentType"] = content_type
if record_wrapping is not None:
self.config["RecordWrapperType"] = record_wrapping
if input_mode is not None:
self.config["InputMode"] = input_mode
if attribute_names is not None:
self.config["DataSource"]["S3DataSource"]["AttributeNames"] = attribute_names
if shuffle_config is not None:
self.config["ShuffleConfig"] = {"Seed": shuffle_config.seed}


class FileSystemInput(object):
"""Amazon SageMaker channel configurations for file system data sources.

Attributes:
config (dict[str, dict]): A Sagemaker File System ``DataSource``.
"""

def __init__(
self, file_system_id, file_system_type, directory_path, file_system_access_mode="ro"
):
"""Create a new file system input used by an SageMaker training job.

Args:
file_system_id (str): An Amazon file system ID starting with 'fs-'.
file_system_type (str): The type of file system used for the input.
Valid values: 'EFS', 'FSxLustre'.
directory_path (str): Relative path to the root directory (mount point) in
the file system.
Reference: https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html and
https://docs.aws.amazon.com/fsx/latest/LustreGuide/mount-fs-auto-mount-onreboot.html
file_system_access_mode (str): Permissions for read and write.
Valid values: 'ro' or 'rw'. Defaults to 'ro'.
"""

if file_system_type not in FILE_SYSTEM_TYPES:
raise ValueError(
"Unrecognized file system type: %s. Valid values: %s."
% (file_system_type, ", ".join(FILE_SYSTEM_TYPES))
)

if file_system_access_mode not in FILE_SYSTEM_ACCESS_MODES:
raise ValueError(
"Unrecognized file system access mode: %s. Valid values: %s."
% (file_system_access_mode, ", ".join(FILE_SYSTEM_ACCESS_MODES))
)

self.config = {
"DataSource": {
"FileSystemDataSource": {
"FileSystemId": file_system_id,
"FileSystemType": file_system_type,
"DirectoryPath": directory_path,
"FileSystemAccessMode": file_system_access_mode,
}
}
}
Loading