Skip to content

Commit 2cf69b2

Browse files
GaryTu1020chuyang-deng
authored andcommitted
feature: Git integration for CodeCommit (#927)
* add functions, tests and doc for CodeCommit
1 parent fb309bc commit 2cf69b2

File tree

8 files changed

+405
-78
lines changed

8 files changed

+405
-78
lines changed

doc/overview.rst

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ Here is an example:
185185
186186
Use Scripts Stored in a Git Repository
187187
--------------------------------------
188-
When you create an estimator, you can specify a training script that is stored in a GitHub or other Git repository as the entry point for the estimator, so that you don't have to download the scripts locally.
188+
When you create an estimator, you can specify a training script that is stored in a GitHub (or other Git) or CodeCommit repository as the entry point for the estimator, so that you don't have to download the scripts locally.
189189
If you do so, source directory and dependencies should be in the same repo if they are needed. Git support can be enabled simply by providing ``git_config`` parameter
190190
when creating an ``Estimator`` object. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies``
191191
should be relative paths in the Git repo if provided.
@@ -195,19 +195,26 @@ The ``git_config`` parameter includes fields ``repo``, ``branch``, ``commit``,
195195
repository where your training script is stored. If you don't provide ``branch``, the default value 'master' is used.
196196
If you don't provide ``commit``, the latest commit in the specified branch is used.
197197

198-
``2FA_enabled``, ``username``, ``password`` and ``token`` are used for authentication. Set ``2FA_enabled`` to 'True' if
199-
two-factor authentication is enabled for the GitHub (or other Git) account, otherwise set it to 'False'.
200-
If you do not provide a value for ``2FA_enabled``, a default value of 'False' is used.
198+
``2FA_enabled``, ``username``, ``password`` and ``token`` are used for authentication. For GitHub
199+
(or other Git) accounts, set ``2FA_enabled`` to 'True' if two-factor authentication is enabled for the
200+
account, otherwise set it to 'False'. If you do not provide a value for ``2FA_enabled``, a default
201+
value of 'False' is used. CodeCommit does not support two-factor authentication, so do not provide
202+
"2FA_enabled" with CodeCommit repositories.
201203

204+
For GitHub or other Git repositories,
202205
If ``repo`` is an SSH URL, you should either have no passphrase for the SSH key pairs, or have the ``ssh-agent`` configured
203206
so that you are not prompted for the SSH passphrase when you run a ``git clone`` command with SSH URLs. For SSH URLs, it
204-
does not matter whether two-factor authentication is enabled.
205-
206-
If ``repo`` is an https URL, 2FA matters. When 2FA is disabled, either ``token`` or ``username``+``password`` will be
207+
does not matter whether two-factor authentication is enabled. If ``repo`` is an HTTPS URL, 2FA matters. When 2FA is disabled, either ``token`` or ``username``+``password`` will be
207208
used for authentication if provided (``token`` prioritized). When 2FA is enabled, only token will be used for
208209
authentication if provided. If required authentication info is not provided, python SDK will try to use local
209210
credentials storage to authenticate. If that fails either, an error message will be thrown.
210211

212+
For CodeCommit repos, please make sure you have completed the authentication setup: https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up.html.
213+
2FA is not supported by CodeCommit, so ``2FA_enabled`` should not be provided. There is no token in CodeCommit, so
214+
``token`` should not be provided either. If ``repo`` is an SSH URL, the requirements are the same as GitHub repos.
215+
If ``repo`` is an HTTPS URL, ``username``+``password`` will be used for authentication if they are provided; otherwise,
216+
Python SDK will try to use either CodeCommit credential helper or local credential storage for authentication.
217+
211218
Here are some examples of creating estimators with Git support:
212219

213220
.. code:: python
@@ -276,6 +283,19 @@ Here are some examples of creating estimators with Git support:
276283
train_instance_count=1,
277284
train_instance_type='local')
278285
286+
.. code:: python
287+
288+
# This example specifies a CodeCommit repository, and try to authenticate with provided username+password
289+
git_config = {'repo': 'https://git-codecommit.us-west-2.amazonaws.com/v1/repos/your_repo_name',
290+
'username': 'username',
291+
'password': 'passw0rd!'}
292+
293+
mx_estimator = MXNet(entry_point='mxnet/mnist.py',
294+
role='SageMakerRole',
295+
git_config=git_config,
296+
train_instance_count=1,
297+
train_instance_type='ml.c4.xlarge')
298+
279299
Git support can be used not only for training jobs, but also for hosting models. The usage is the same as the above,
280300
and ``git_config`` should be provided when creating model objects, e.g. ``TensorFlowModel``, ``MXNetModel``, ``PyTorchModel``.
281301

src/sagemaker/estimator.py

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -976,11 +976,10 @@ def __init__(
976976
977977
You can assign entry_point='src/train.py'.
978978
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
979-
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token`` (default: None). The fields are
980-
optional except ``repo``. If ``branch`` is not specified, master branch will be used. If ``commit``
981-
is not specified, the latest commit in the required branch will be used. 'branch' and 'commit' are
982-
optional. If 'branch' is not specified, 'master' branch will be used. If 'commit' is not specified,
983-
the latest commit in the required branch will be used.
979+
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token``. The ``repo`` field is required.
980+
All other fields are optional. ``repo`` specifies the Git repository where your training script is
981+
stored. If you don't provide ``branch``, the default value 'master' is used. If you don't provide
982+
``commit``, the latest commit in the specified branch is used.
984983
Example:
985984
986985
The following config:
@@ -991,15 +990,25 @@ def __init__(
991990
992991
results in cloning the repo specified in 'repo', then checkout the 'master' branch, and checkout
993992
the specified commit.
994-
``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose.
995-
``2FA_enabled`` must be ``True`` or ``False`` if it is provided. If ``2FA_enabled`` is not provided,
996-
we consider 2FA as disabled. For GitHub and other Git repos, when ssh urls are provided, it does not
997-
make a difference whether 2FA is enabled or disabled; an ssh passphrase should be in local storage.
998-
When https urls are provided: if 2FA is disabled, then either token or username+password will
999-
be used for authentication if provided (token prioritized); if 2FA is enabled, only token will
1000-
be used for authentication if provided. If required authentication info is not provided, python SDK
1001-
will try to use local credentials storage to authenticate. If that fails either, an error message will
1002-
be thrown.
993+
``2FA_enabled``, ``username``, ``password`` and ``token`` are used for authentication. For GitHub
994+
(or other Git) accounts, set ``2FA_enabled`` to 'True' if two-factor authentication is enabled for the
995+
account, otherwise set it to 'False'. If you do not provide a value for ``2FA_enabled``, a default
996+
value of 'False' is used. CodeCommit does not support two-factor authentication, so do not provide
997+
"2FA_enabled" with CodeCommit repositories.
998+
999+
For GitHub and other Git repos, when SSH URLs are provided, it doesn't matter whether 2FA is
1000+
enabled or disabled; you should either have no passphrase for the SSH key pairs, or have the ssh-agent
1001+
configured so that you will not be prompted for SSH passphrase when you do 'git clone' command with SSH
1002+
URLs. When HTTPS URLs are provided: if 2FA is disabled, then either token or username+password will be
1003+
used for authentication if provided (token prioritized); if 2FA is enabled, only token will be used for
1004+
authentication if provided. If required authentication info is not provided, python SDK will try to use
1005+
local credentials storage to authenticate. If that fails either, an error message will be thrown.
1006+
1007+
For CodeCommit repos, 2FA is not supported, so '2FA_enabled' should not be provided. There is no token
1008+
in CodeCommit, so 'token' should not be provided too. When 'repo' is an SSH URL, the requirements are
1009+
the same as GitHub-like repos. When 'repo' is an HTTPS URL, username+password will be used for
1010+
authentication if they are provided; otherwise, python SDK will try to use either CodeCommit credential
1011+
helper or local credential storage for authentication.
10031012
source_dir (str): Path (absolute or relative) to a directory with any other training
10041013
source code dependencies aside from the entry point file (default: None). Structure within this
10051014
directory are preserved when training on Amazon SageMaker. If 'git_config' is provided,

src/sagemaker/git_utils.py

Lines changed: 64 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -25,18 +25,26 @@ def git_clone_repo(git_config, entry_point, source_dir=None, dependencies=None):
2525
and set ``entry_point``, ``source_dir`` and ``dependencies`` to the right file or directory in the repo cloned.
2626
2727
Args:
28-
git_config (dict[str, object]): Git configurations used for cloning files, including ``repo``, ``branch``,
29-
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token``. The fields are optional except
30-
``repo``. If ``branch`` is not specified, master branch will be used. If ``commit`` is not specified,
31-
the latest commit in the required branch will be used. ``2FA_enabled``, ``username``, ``password`` and
32-
``token`` are for authentication purpose.
33-
``2FA_enabled`` must be ``True`` or ``False`` if it is provided. If ``2FA_enabled`` is not provided, we
34-
consider 2FA as disabled. For GitHub and other Git repos, when ssh urls are provided, it does not make a
35-
difference whether 2FA is enabled or disabled; an ssh passphrase should be in local storage. When
36-
https urls are provided: if 2FA is disabled, then either token or username+password will be used for
37-
authentication if provided (token prioritized); if 2FA is enabled, only token will be used for
28+
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
29+
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token``. The ``repo`` field is required.
30+
All other fields are optional. ``repo`` specifies the Git repository where your training script is stored.
31+
If you don't provide ``branch``, the default value 'master' is used. If you don't provide ``commit``,
32+
the latest commit in the specified branch is used. ``2FA_enabled``, ``username``, ``password`` and
33+
``token`` are for authentication purpose. If ``2FA_enabled`` is not provided, we consider 2FA as disabled.
34+
35+
For GitHub and GitHub-like repos, when SSH URLs are provided, it doesn't matter whether 2FA is
36+
enabled or disabled; you should either have no passphrase for the SSH key pairs, or have the ssh-agent
37+
configured so that you will not be prompted for SSH passphrase when you do 'git clone' command with SSH
38+
URLs. When https URLs are provided: if 2FA is disabled, then either token or username+password will be
39+
used for authentication if provided (token prioritized); if 2FA is enabled, only token will be used for
3840
authentication if provided. If required authentication info is not provided, python SDK will try to use
3941
local credentials storage to authenticate. If that fails either, an error message will be thrown.
42+
43+
For CodeCommit repos, 2FA is not supported, so '2FA_enabled' should not be provided. There is no token in
44+
CodeCommit, so 'token' should not be provided too. When 'repo' is an SSH URL, the requirements are the
45+
same as GitHub-like repos. When 'repo' is an https URL, username+password will be used for
46+
authentication if they are provided; otherwise, python SDK will try to use either CodeCommit credential
47+
helper or local credential storage for authentication.
4048
entry_point (str): A relative location to the Python source file which should be executed as the entry point
4149
to training or model hosting in the Git repo.
4250
source_dir (str): A relative location to a directory with other training or model hosting source code
@@ -115,7 +123,12 @@ def _generate_and_run_clone_command(git_config, dest_dir):
115123
Raises:
116124
CalledProcessError: If failed to clone git repo.
117125
"""
118-
_clone_command_for_github_like(git_config, dest_dir)
126+
if git_config["repo"].startswith("https://git-codecommit") or git_config["repo"].startswith(
127+
"ssh://git-codecommit"
128+
):
129+
_clone_command_for_codecommit(git_config, dest_dir)
130+
else:
131+
_clone_command_for_github_like(git_config, dest_dir)
119132

120133

121134
def _clone_command_for_github_like(git_config, dest_dir):
@@ -136,14 +149,14 @@ def _clone_command_for_github_like(git_config, dest_dir):
136149
if not is_https and not is_ssh:
137150
raise ValueError("Invalid Git url provided.")
138151
if is_ssh:
139-
_clone_command_for_github_like_ssh(git_config, dest_dir)
152+
_clone_command_for_ssh(git_config, dest_dir)
140153
elif "2FA_enabled" in git_config and git_config["2FA_enabled"] is True:
141154
_clone_command_for_github_like_https_2fa_enabled(git_config, dest_dir)
142155
else:
143156
_clone_command_for_github_like_https_2fa_disabled(git_config, dest_dir)
144157

145158

146-
def _clone_command_for_github_like_ssh(git_config, dest_dir):
159+
def _clone_command_for_ssh(git_config, dest_dir):
147160
if "username" in git_config or "password" in git_config or "token" in git_config:
148161
warnings.warn("SSH cloning, authentication information in git config will be ignored.")
149162
_run_clone_command(git_config["repo"], dest_dir)
@@ -173,6 +186,44 @@ def _clone_command_for_github_like_https_2fa_enabled(git_config, dest_dir):
173186
_run_clone_command(updated_url, dest_dir)
174187

175188

189+
def _clone_command_for_codecommit(git_config, dest_dir):
190+
"""check if a git_config param representing a CodeCommit repo is valid, if it is, create the command to
191+
git clone the repo, and run it.
192+
193+
Args:
194+
git_config ((dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
195+
and ``commit``.
196+
dest_dir (str): The local directory to clone the Git repo into.
197+
198+
Raises:
199+
ValueError: If git_config['repo'] is in the wrong format.
200+
CalledProcessError: If failed to clone git repo.
201+
"""
202+
is_https = git_config["repo"].startswith("https://git-codecommit")
203+
is_ssh = git_config["repo"].startswith("ssh://git-codecommit")
204+
if not is_https and not is_ssh:
205+
raise ValueError("Invalid Git url provided.")
206+
if "2FA_enabled" in git_config:
207+
warnings.warn("CodeCommit does not support 2FA, '2FA_enabled' will be ignored.")
208+
if "token" in git_config:
209+
warnings.warn("There are no tokens in CodeCommit, the token provided will be ignored.")
210+
if is_ssh:
211+
_clone_command_for_ssh(git_config, dest_dir)
212+
else:
213+
_clone_command_for_codecommit_https(git_config, dest_dir)
214+
215+
216+
def _clone_command_for_codecommit_https(git_config, dest_dir):
217+
updated_url = git_config["repo"]
218+
if "username" in git_config and "password" in git_config:
219+
updated_url = _insert_username_and_password_to_repo_url(
220+
url=git_config["repo"], username=git_config["username"], password=git_config["password"]
221+
)
222+
elif "username" in git_config or "password" in git_config:
223+
warnings.warn("Credentials provided in git config will be ignored.")
224+
_run_clone_command(updated_url, dest_dir)
225+
226+
176227
def _run_clone_command(repo_url, dest_dir):
177228
"""Run the 'git clone' command with the repo url and the directory to clone the repo into.
178229

0 commit comments

Comments
 (0)