Skip to content

Reinvent 2024 early #4946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
e267244
Base model trainer (#1521)
benieric Sep 30, 2024
9774f9e
feature: support script mode with local train.sh (#1523)
benieric Oct 3, 2024
fba3285
Image Spec refactoring and updates (#1525)
nargokul Oct 3, 2024
6a0224f
Add unit tests for ModelTrainer (#1527)
benieric Oct 3, 2024
7446b09
Add example notebook (#1528)
benieric Oct 7, 2024
cb7af78
Add enviornment variable bootstrapping script (#1530)
benieric Oct 8, 2024
80a1b89
feature: add utility function to capture local snapshot (#1524)
pintaoz-aws Oct 8, 2024
93a3c6d
Support intelligent parameters (#1540)
pintaoz-aws Oct 15, 2024
4fe8738
Revert Image Spec (#1541)
nargokul Oct 15, 2024
72e4266
Cleanup ModelTrainer (#1542)
benieric Oct 15, 2024
89edb6d
General image builder (#1546)
pintaoz-aws Oct 18, 2024
b40a499
Latest Container Image (#1545)
nargokul Oct 21, 2024
c3f432c
Cleanup ModelTrainer code (#1552)
benieric Oct 24, 2024
2e17bcb
feat: add pre-processing and post-processing logic to inference_spec …
pravali96 Nov 1, 2024
21a11a9
Add Distributed Training Support Model Trainer (#1536)
benieric Nov 4, 2024
a406f64
Add path to set Additional Settings in ModelTrainer (#1555)
benieric Nov 5, 2024
8cc19a3
Mask Sensitive Env Logs in Container (#1568)
benieric Nov 7, 2024
a8ed4ec
Fix bug in script mode setup ModelTrainer (#1575)
benieric Nov 8, 2024
ce55d45
Feature: ModelBuilder supports HuggingFace Models with benchmark data…
xiongz945 Nov 9, 2024
1ad75c9
Simplify Config Class Names and DistributedRunner structures (#1573)
benieric Nov 11, 2024
2aad9cd
Remove ignored files
benieric Nov 11, 2024
fa6ae28
Add in_process mode support for DJL and TorchServe servers (#1570)
pravali96 Nov 11, 2024
24b0dc0
Pass hyperparameters as CLI args (#1577)
benieric Nov 12, 2024
053808f
Trainer handshake (#1535)
nargokul Nov 12, 2024
debcdc2
Add Support for Training Recipes (#1565)
benieric Nov 12, 2024
8cf9631
Support building image from Dockerfile (#1571)
pintaoz-aws Nov 12, 2024
51fb427
Use exact python path in trainer template (#1584)
benieric Nov 14, 2024
70ae24f
Unified Deployment interface in Model Builder (#1549)
nargokul Nov 14, 2024
6b90f89
Add recipes examples (#1582)
benieric Nov 14, 2024
67f535d
update notebooks (#1588)
benieric Nov 14, 2024
e4701e8
update notebooks (#1592)
benieric Nov 14, 2024
5a37fc5
Single container local training (#1556)
pintaoz-aws Nov 14, 2024
5dae384
Bug fixes (#1596)
pintaoz-aws Nov 15, 2024
b29da8f
Update ModelTrainer Notebooks (#1597)
benieric Nov 15, 2024
2718402
add inference morpheus nbs (#1594)
gwang111 Nov 15, 2024
758a311
Fix: move the functionality from latest_container_image to retrieve (…
chad119 Nov 15, 2024
7ef6b99
Add bugbash bootstrapping (#1598)
benieric Nov 15, 2024
2cc4495
Fix: remove the special condition and fix the unit test (#1601)
chad119 Nov 15, 2024
df990c0
Notebooks update for Bugbash (#1595)
nargokul Nov 15, 2024
1a3330a
Add Rich Logging to Model Builder (#1604)
nargokul Nov 18, 2024
611ea9a
Fix: codestyles (#1606)
benieric Nov 19, 2024
fa02963
add modelID support to model builder InProcess model (#1580)
pravali96 Nov 19, 2024
a9dd628
Update kandinsky in ModelTrainer and allow setting requirements (#1587)
benieric Nov 19, 2024
5036fb8
[Updated] Add telemetry to ModelTrainer, Estimator and ModelBuilder (…
zhaoqizqwang Nov 20, 2024
1e17a1e
Integration tests for Model Builder Handshake (#1610)
nargokul Nov 20, 2024
aa2e62d
Use sagemaker core Session (#1607)
benieric Nov 20, 2024
2438a3f
Skip JS model mapping with env vars or image URI provided (#1599)
xiongz945 Nov 20, 2024
0e74aff
pin xgboost dlc to 1.7.1 to fix test (#1616)
gwang111 Nov 21, 2024
6985ec5
Revert image builder (#1614)
pintaoz-aws Nov 21, 2024
bd8e42a
add integ test for base_model_builder_deploy and remove print stateme…
chad119 Nov 21, 2024
d544e39
Fix tests and codestyle (#1619)
zhaoqizqwang Nov 21, 2024
0a488fe
Fix: Correctly serialize SM_HPS env var (#1611)
benieric Nov 21, 2024
784a18f
Intelligent defaults for Model Trainer (#1586)
nargokul Nov 22, 2024
9d1a418
add in-process mode definition to docs (#1622)
pravali96 Nov 22, 2024
a3e328e
Update ModelTrainer Interface Parameters (#1617)
benieric Nov 22, 2024
b50330f
Model Trainer Bucket improvements (#1618)
nargokul Nov 23, 2024
37f079c
Add interface units for ModelTrainer (#1631)
benieric Nov 26, 2024
cbba5eb
Update hyperpod recipe uris (#1629)
benieric Nov 26, 2024
4f31237
Integ tests for local mode model trainer (#1623)
pintaoz-aws Nov 27, 2024
720faab
Morpheus tests (#1633)
nargokul Nov 27, 2024
98d1d23
remove example notebooks artifacts (#1634)
benieric Nov 27, 2024
96db5c7
feat: Partner App Auth Provider for SDK support (#1548)
edwardps Oct 22, 2024
cff8216
change: fix the file uploading signature verification error (#1551)
edwardps Oct 24, 2024
45b89cc
Feature: Support GPU training recipes with Sagemaker Python SDK (#1516)
schinmayee Sep 19, 2024
13e10c9
Feature: Support Neuron training recipes. (#1526)
schinmayee Oct 4, 2024
1f34950
Feature: Resolve recipes correctly before launching (#1529)
schinmayee Oct 8, 2024
e68f4f5
Feature: Add unit tests for recipes and minor bug fixes. (#1532)
schinmayee Oct 11, 2024
2cc2caf
Feature: Move image uris and git repos for training recipes to json (…
schinmayee Oct 21, 2024
9480ee0
Update MANIFEST.in so that wheel builds correctly (#1563)
schinmayee Nov 2, 2024
30dfdca
Remove default values for fields in recipe_overrides and fix recipe p…
schinmayee Nov 5, 2024
c0e3958
Change default source directory to current, add option to specify sou…
schinmayee Nov 15, 2024
ce2376f
Changes for SMP v2.7.0 (#1609)
adtian2 Nov 22, 2024
74d6b7c
Update URIs to public for training recipes (#1621)
schinmayee Nov 22, 2024
fdf2e9a
Neuron URIs update (#1626)
schinmayee Nov 25, 2024
3bce287
Usage docs for training recipes (#1630)
schinmayee Nov 26, 2024
bd4a6cc
Add model trainer documentation (#1639)
benieric Dec 4, 2024
9a5b32f
Enable the Recipe tests marked with @pytest.mark.skip(reason="Hyperpo…
nargokul Dec 4, 2024
659244b
Add graphne to the doc requirements
pintaoz-aws Dec 4, 2024
ad3538b
Add graphene to doc requirements
pintaoz-aws Dec 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ env/
.python-version
*.html
**/_repack_script_launcher.sh
src/sagemaker/modules/train/container_drivers/sm_train.sh
src/sagemaker/modules/train/container_drivers/sourcecode.json
src/sagemaker/modules/train/container_drivers/distributed.json
tests/data/**/_repack_model.py
tests/data/experiment/sagemaker-dev-1.0.tar.gz
src/sagemaker/serve/tmp_workspace
1 change: 1 addition & 0 deletions .pydocstylerc
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
inherit = false
ignore = D104,D107,D202,D203,D213,D214,D400,D401,D404,D406,D407,D411,D413,D414,D415,D417
match = (?!record_pb2).*\.py
match-dir = (?!.*test).*
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
recursive-include src/sagemaker *.py

include src/sagemaker/image_uri_config/*.json
include src/sagemaker/pytorch/training_recipes.json
include src/sagemaker/serve/schema/*.json
include src/sagemaker/serve/requirements.txt
include src/sagemaker/modules/train/sm_recipes/training_recipes.json
recursive-include requirements *

include VERSION
Expand Down
1 change: 1 addition & 0 deletions doc/api/training/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Training APIs
.. toctree::
:maxdepth: 4

model_trainer
algorithm
analytics
automl
Expand Down
17 changes: 17 additions & 0 deletions doc/api/training/model_trainer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
ModelTrainer
------------

.. autoclass:: sagemaker.modules.train.model_trainer.ModelTrainer
:members:

Configs
~~~~~~~

.. automodule:: sagemaker.modules.configs
:members:

Distributed
~~~~~~~~~~~

.. automodule:: sagemaker.modules.distributed
:members:
132 changes: 125 additions & 7 deletions doc/frameworks/pytorch/using_pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,9 @@ To train a PyTorch model by using the SageMaker Python SDK:
.. |create pytorch estimator| replace:: Create a ``sagemaker.pytorch.PyTorch`` Estimator
.. _create pytorch estimator: #create-an-estimator

.. |call fit| replace:: Call the estimator's ``fit`` method
.. _call fit: #call-the-fit-method

1. `Prepare a training script <#prepare-a-pytorch-training-script>`_
1. `Prepare a training script <#prepare-a-pytorch-training-script>`_ OR `Choose an Amazon SageMaker HyperPod recipe`_
2. |create pytorch estimator|_
3. |call fit|_
3. `Call the estimator's fit method or ModelTrainer's train method`_

Prepare a PyTorch Training Script
=================================
Expand Down Expand Up @@ -175,6 +172,16 @@ see `AWS Deep Learning Containers <https://github.com/aws/deep-learning-containe
- `Images for HuggingFace <https://github.com/aws/deep-learning-containers/tree/master/huggingface>`__


Choose an Amazon Sagemaker HyperPod recipe
==========================================

Alternatively, instead of using your own training script, you can choose an
`Amazon SageMaker HyperPod recipe <https://github.com/aws/sagemaker-hyperpod-recipes>`_ to launch training for a supported model.
If using a recipe, you do not need to provide your own training script. You only need to determine
which recipe you want to run. You can modify a recipe as explained in the next section.



Create an Estimator
===================

Expand All @@ -196,10 +203,121 @@ directories ('train' and 'test').
'test': 's3://my-data-bucket/path/to/my/test/data'})


Amazon Sagemaker HyperPod recipes
---------------------------------
Alternatively, if you are using Amazon SageMaker HyperPod recipes, you can follow the following instructions:

Prerequisites: you need ``git`` installed on your client to access Amazon SageMaker HyperPod recipes code.

Call the fit Method
===================
When using a recipe, you must set the ``training_recipe`` arg in place of providing a training script.
This can be a recipe from `here <https://github.com/aws/sagemaker-hyperpod-recipes>`_
or a local file or a custom url. Please note that you must override the following using
``recipe_overrides``:

* directory paths for the local container in the recipe as appropriate for Python SDK
* the output s3 URIs
* Huggingface access token
* any other recipe fields you wish to edit

The code snippet below shows an example.
Please refer to `SageMaker docs <https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html>`_
for more details about the expected local paths in the container and the Amazon SageMaker
HyperPod recipes tutorial for more examples.
You can override the fields by either setting ``recipe_overrides`` or
providing a modified ``training_recipe`` through a local file or a custom url.
When using the recipe, any provided ``entry_point`` will be ignored.

SageMaker will automatically set up the distribution args.
It will also determine the image to use for your model and device type,
but you can override this with the ``image_uri`` arg.

You can also override the number of nodes in the recipe with the ``instance_count`` arg to estimator.
``source_dir`` will default to current working directory unless specified.
A local copy of training scripts and recipe will be saved in the ``source_dir``.
You can specify any additional packages you want to install for training in an optional ``requirements.txt`` in the ``source_dir``.

Note for llama3.2 multi-modal models, you need to upgrade transformers library by providing a ``requirements.txt`` in the source file with ``transformers==4.45.2``.
Please refer to the Amazon SageMaker HyperPod recipes documentation for more details.


Here is an example usage for recipe ``hf_llama3_8b_seq8k_gpu_p5x16_pretrain``.


.. code:: python

overrides = {
"run": {
"results_dir": "/opt/ml/model",
},
"exp_manager": {
"exp_dir": "",
"explicit_log_dir": "/opt/ml/output/tensorboard",
"checkpoint_dir": "/opt/ml/checkpoints",
},
"model": {
"data": {
"train_dir": "/opt/ml/input/data/train",
"val_dir": "/opt/ml/input/data/val",
},
},
}
pytorch_estimator = PyTorch(
output_path=output_path,
base_job_name=f"llama-recipe",
role=role,
instance_type="ml.p5.48xlarge",
training_recipe="hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
recipe_overrides=recipe_overrides,
sagemaker_session=sagemaker_session,
tensorboard_output_config=tensorboard_output_config,
)
pytorch_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data',
'test': 's3://my-data-bucket/path/to/my/test/data'})

# Or alternatively with ModelTrainer
recipe_overrides = {
"run": {
"results_dir": "/opt/ml/model",
},
"exp_manager": {
"exp_dir": "",
"explicit_log_dir": "/opt/ml/output/tensorboard",
"checkpoint_dir": "/opt/ml/checkpoints",
},
"model": {
"data": {
"train_dir": "/opt/ml/input/data/train",
"val_dir": "/opt/ml/input/data/val",
},
},
}

model_trainer = ModelTrainer.from_recipe(
output_path=output_path,
base_job_name=f"llama-recipe",
training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
recipe_overrides=recipe_overrides,
compute=Compute(instance_type="ml.p5.48xlarge"),
sagemaker_session=sagemaker_session
).with_tensorboard_output_config(
tensorboard_output_config=tensorboard_output_config
)

train_input = Input(
channel_name="train",
data_source="s3://my-data-bucket/path/to/my/training/data"
)

test_input = Input(
channel_name="test",
data_source="s3://my-data-bucket/path/to/my/test/data"
)

model_trainer.train(input_data_config=[train_input, test_input)


Call the estimator's fit method or ModelTrainer's train method
==============================================================

You start your training script by calling ``fit`` on a ``PyTorch`` Estimator. ``fit`` takes both required and optional
arguments.
Expand Down
45 changes: 43 additions & 2 deletions doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Using the SageMaker Python SDK

SageMaker Python SDK provides several high-level abstractions for working with Amazon SageMaker. These are:

- **ModelTrainer**: New interface encapsulating training on SageMaker.
- **Estimators**: Encapsulate training on SageMaker.
- **Models**: Encapsulate built ML models.
- **Predictors**: Provide real-time inference and transformation using Python data-types against a SageMaker endpoint.
Expand All @@ -24,8 +25,8 @@ Train a Model with the SageMaker Python SDK
To train a model by using the SageMaker Python SDK, you:

1. Prepare a training script
2. Create an estimator
3. Call the ``fit`` method of the estimator
2. Create a ModelTrainer or Estimator
3. Call the ``train`` method of the ModelTrainer or the ``fit`` method of the Estimator

After you train a model, you can save it, and then serve the model as an endpoint to get real-time inferences or get inferences for an entire dataset by using batch transform.

Expand Down Expand Up @@ -85,6 +86,46 @@ If you want to use, for example, boolean hyperparameters, you need to specify ``
For more on training environment variables, please visit `SageMaker Containers <https://github.com/aws/sagemaker-containers>`_.


Using ModelTrainer
==================

To use the ModelTrainer class, you need to provide a few essential parameters such as the training image URI and the source code configuration. The class allows you to spin up a SageMaker training job with minimal parameters, particularly by specifying the source code and training image.

For more information about class definitions see `ModelTrainer <https://sagemaker.readthedocs.io/en/stable/api/training/model_trainer.html>`_.

Example: Launching a Training Job with Custom Script

.. code:: python

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, InputData

# Image URI for the training job
pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"

# Define the script to be run
source_code = SourceCode(
source_dir="basic-script-mode",
requirements="requirements.txt",
entry_script="custom_script.py",
)

# Define the ModelTrainer
model_trainer = ModelTrainer(
training_image=pytorch_image,
source_code=source_code,
base_job_name="script-mode",
)

# Pass the input data
input_data = InputData(
channel_name="train",
data_source=training_input_path, # S3 path where training data is stored
)

# Start the training job
model_trainer.train(input_data_config=[input_data], wait=False)

Using Estimators
================

Expand Down
1 change: 1 addition & 0 deletions doc/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ packaging==20.9
jinja2==3.1.4
schema==0.7.5
accelerate>=0.24.1,<=0.27.0
graphene<4.0
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,12 @@ dependencies = [
"boto3>=1.34.142,<2.0",
"cloudpickle==2.2.1",
"docker",
"fastapi",
"google-pasta",
"importlib-metadata>=1.4.0,<7.0",
"jsonschema",
"numpy>=1.9.0,<2.0",
"omegaconf>=2.2,<2.3",
"packaging>=20.0",
"pandas",
"pathos",
Expand All @@ -53,6 +55,7 @@ dependencies = [
"tblib>=1.7.0,<4",
"tqdm",
"urllib3>=1.26.8,<3.0.0",
"uvicorn"
]

[project.scripts]
Expand Down
1 change: 1 addition & 0 deletions requirements/extras/test_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,4 @@ uvicorn>=0.30.1
fastapi==0.115.4
nest-asyncio
sagemaker-mlflow>=0.1.0
deepdiff>=8.0.0
1 change: 1 addition & 0 deletions src/sagemaker/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,5 +74,6 @@
)

from sagemaker.debugger import ProfilerConfig, Profiler # noqa: F401
from sagemaker.partner_app.auth_provider import PartnerAppAuthProvider # noqa: F401

__version__ = importlib_metadata.version("sagemaker")
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.
"""Config Classes for taking in parameters for Batch Inference"""

from __future__ import absolute_import
from pydantic import BaseModel


class BatchTransformInferenceConfig(BaseModel):
"""Config class for Batch Transform Inference

* Can be used to deploy from ModelBuilder
"""

instance_count: int
instance_type: str
output_path: str
22 changes: 22 additions & 0 deletions src/sagemaker/config/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@
REGION_NAME = "region_name"
TELEMETRY_OPT_OUT = "TelemetryOptOut"
NOTEBOOK_JOB = "NotebookJob"
MODEL_TRAINER = "ModelTrainer"


def _simple_path(*args: str):
Expand All @@ -142,6 +143,7 @@ def _simple_path(*args: str):
)
TRAINING_JOB_ROLE_ARN_PATH = _simple_path(SAGEMAKER, TRAINING_JOB, ROLE_ARN)
TRAINING_JOB_VPC_CONFIG_PATH = _simple_path(SAGEMAKER, TRAINING_JOB, VPC_CONFIG)
TRAINING_JOB_TAGS_PATH = _simple_path(SAGEMAKER, TRAINING_JOB, TAGS)
TRAINING_JOB_SECURITY_GROUP_IDS_PATH = _simple_path(
TRAINING_JOB_VPC_CONFIG_PATH, SECURITY_GROUP_IDS
)
Expand Down Expand Up @@ -656,6 +658,25 @@ def _simple_path(*args: str):
"minItems": 1,
"maxItems": 15,
},
"role": {
TYPE: "string",
"pattern": r"^arn:aws[a-z\-]*:iam::\d{12}:role/?[a-zA-Z_0-9+=,.@\-_/]+$",
"minLength": 20,
"maxLength": 2048,
},
"baseJobName": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"sourceCode": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"distributed": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"compute": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"networking": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"stoppingCondition": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"trainingImage": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"trainingImageConfig": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"algorithmName": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"outputDataConfig": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"trainingInputMode": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"environment": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
"hyperparameters": {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
},
PROPERTIES: {
SCHEMA_VERSION: {
Expand Down Expand Up @@ -709,6 +730,7 @@ def _simple_path(*args: str):
},
},
},
MODEL_TRAINER: {TYPE: OBJECT, ADDITIONAL_PROPERTIES: True},
ESTIMATOR: {
TYPE: OBJECT,
ADDITIONAL_PROPERTIES: False,
Expand Down
Loading
Loading