Skip to content

Commit a1916a8

Browse files
authored
Add benchmarking script (#86)
* Add benchmarking script
1 parent 1338820 commit a1916a8

40 files changed

+11564
-0
lines changed

benchmarks/README.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# TensorFlow benchmarking scripts
2+
3+
This folder contains the TF training scripts https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks.
4+
5+
## Basic usage
6+
**execute_tensorflow_training.py** uses SageMaker python sdk to start a training job. It takes the following parameters:
7+
8+
- role: SageMaker role used for training
9+
- region: SageMaker region
10+
- py-versions: py2 or py3 or "py2, py3"
11+
- instance-types: A list of SageMaker instance types, for example 'ml.p2.xlarge, ml.c4.xlarge'. Use 'local' for local mode training.
12+
- checkpoint-path: The S3 location where the model checkpoints and tensorboard events are saved after training
13+
14+
Any unknown arguments will be passed to the training script as additional arguments.
15+
16+
## Examples:
17+
18+
```bash
19+
./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type local --num_epochs 1 --wait
20+
21+
./execute_tensorflow_training.py -t local -r SageMakerRole --instance-type ml.c4.xlarge, ml.c5.xlarge --model resnet50
22+
23+
```
24+
25+
## Using other models, datasets and benchmarks configurations
26+
```python tf_cnn_benchmarks/tf_cnn_benchmarks.py --help``` shows all the options that the script has.
27+
28+
29+
## Tensorboard events and checkpoints
30+
31+
Tensorboard events are being saved to the S3 location defined by the hyperparameter checkpoint_path during training. That location can be overwritten by setting the script argument ```checkpoint-path```:
32+
33+
```bash
34+
python execute_tensorflow_training.py ... --checkpoint-path s3://my/bucket/output/data
35+
```
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
#!/usr/bin/env python
2+
3+
from __future__ import absolute_import
4+
5+
import argparse
6+
import itertools
7+
import os
8+
9+
from sagemaker import Session
10+
from sagemaker.estimator import Framework
11+
from sagemaker.tensorflow import TensorFlow
12+
13+
default_bucket = Session().default_bucket
14+
dir_path = os.path.dirname(os.path.realpath(__file__))
15+
16+
_DEFAULT_HYPERPARAMETERS = {
17+
'batch_size': 32,
18+
'model': 'resnet32',
19+
'num_epochs': 10,
20+
'data_format': 'NHWC',
21+
'summary_verbosity': 1,
22+
'save_summaries_steps': 10,
23+
'data_name': 'cifar10'
24+
}
25+
26+
27+
class ScriptModeTensorFlow(Framework):
28+
"""This class is temporary until the final version of Script Mode is released.
29+
"""
30+
31+
__framework_name__ = "tensorflow-scriptmode-beta"
32+
33+
create_model = TensorFlow.create_model
34+
35+
def __init__(self, py_version='py3', **kwargs):
36+
super(ScriptModeTensorFlow, self).__init__(**kwargs)
37+
self.py_version = py_version
38+
self.image_name = None
39+
self.framework_version = '1.10.0'
40+
41+
42+
def get_args():
43+
parser = argparse.ArgumentParser()
44+
parser.add_argument('-t', '--instance-types', nargs='+', help='<Required> Set flag', required=True)
45+
parser.add_argument('-r', '--role', required=True)
46+
parser.add_argument('-w', '--wait', action='store_true')
47+
parser.add_argument('--region', default='us-west-2')
48+
parser.add_argument('--py-versions', nargs='+', help='<Required> Set flag', default=['py3'])
49+
parser.add_argument('--checkpoint-path',
50+
default=os.path.join(default_bucket(), 'benchmarks', 'checkpoints'),
51+
help='The S3 location where the model checkpoints and tensorboard events are saved after training')
52+
53+
return parser.parse_known_args()
54+
55+
56+
def main(args, script_args):
57+
for instance_type, py_version in itertools.product(args.instance_types, args.py_versions):
58+
base_name = '%s-%s-%s' % (py_version, instance_type[3:5], instance_type[6:])
59+
model_dir = os.path.join(args.checkpoint_path, base_name)
60+
61+
job_hps = create_hyperparameters(model_dir, script_args)
62+
63+
print('hyperparameters:')
64+
print(job_hps)
65+
66+
estimator = ScriptModeTensorFlow(
67+
entry_point='tf_cnn_benchmarks.py',
68+
role='SageMakerRole',
69+
source_dir=os.path.join(dir_path, 'tf_cnn_benchmarks'),
70+
base_job_name=base_name,
71+
train_instance_count=1,
72+
hyperparameters=job_hps,
73+
train_instance_type=instance_type,
74+
)
75+
76+
input_dir = 's3://sagemaker-sample-data-%s/spark/mnist/train/' % args.region
77+
estimator.fit({'train': input_dir}, wait=args.wait)
78+
79+
print("To use TensorBoard, execute the following command:")
80+
cmd = 'S3_USE_HTTPS=0 S3_VERIFY_SSL=0 AWS_REGION=%s tensorboard --host localhost --port 6006 --logdir %s'
81+
print(cmd % (args.region, args.checkpoint_path))
82+
83+
84+
def create_hyperparameters(model_dir, script_args):
85+
job_hps = _DEFAULT_HYPERPARAMETERS.copy()
86+
87+
job_hps.update({'train_dir': model_dir, 'eval_dir': model_dir})
88+
89+
script_arg_keys_without_dashes = [key[2:] if key.startswith('--') else key[1:] for key in script_args[::2]]
90+
script_arg_values = script_args[1::2]
91+
job_hps.update(dict(zip(script_arg_keys_without_dashes, script_arg_values)))
92+
93+
return job_hps
94+
95+
96+
if __name__ == '__main__':
97+
args, script_args = get_args()
98+
main(args, script_args)
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# tf_cnn_benchmarks: High performance benchmarks
2+
3+
tf_cnn_benchmarks contains implementations of several popular convolutional
4+
models, and is designed to be as fast as possible. tf_cnn_benchmarks supports
5+
both running on a single machine or running in distributed mode across multiple
6+
hosts. See the [High-Performance models
7+
guide](https://www.tensorflow.org/performance/performance_models) for more
8+
information.
9+
10+
These models utilize many of the strategies in the [TensorFlow Performance
11+
Guide](https://www.tensorflow.org/performance/performance_guide). Benchmark
12+
results can be found [here](https://www.tensorflow.org/performance/benchmarks).
13+
14+
These models are designed for performance. For models that have clean and
15+
easy-to-read implementations, see the [TensorFlow Official
16+
Models](https://github.com/tensorflow/models/tree/master/official).
17+
18+
## Getting Started
19+
20+
To run ResNet50 with synthetic data without distortions with a single GPU, run
21+
22+
```
23+
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server
24+
```
25+
26+
Note that the master branch of tf_cnn_benchmarks requires the latest nightly
27+
version of TensorFlow. You can install the nightly version by running `pip
28+
install tf-nightly-gpu` in a clean environment, or by installing TensorFlow from
29+
source. We sometimes will create a branch of tf_cnn_benchmarks, in the form of
30+
cnn_tf_vX.Y_compatible, that is compatible with TensorFlow version X.Y For
31+
example, branch
32+
[cnn_tf_v1.9_compatible](https://github.com/tensorflow/benchmarks/tree/cnn_tf_v1.9_compatible/scripts/tf_cnn_benchmarks)
33+
works with TensorFlow 1.9.
34+
35+
Some important flags are
36+
37+
* model: Model to use, e.g. resnet50, inception3, vgg16, and alexnet.
38+
* num_gpus: Number of GPUs to use.
39+
* data_dir: Path to data to process. If not set, synthetic data is used. To
40+
use Imagenet data use these
41+
[instructions](https://github.com/tensorflow/models/tree/master/research/inception#getting-started)
42+
as a starting point.
43+
* batch_size: Batch size for each GPU.
44+
* variable_update: The method for managing variables: parameter_server
45+
,replicated, distributed_replicated, independent
46+
* local_parameter_device: Device to use as parameter server: cpu or gpu.
47+
48+
To see the full list of flags, run `python tf_cnn_benchmarks.py --help`.
49+
50+
To run ResNet50 with real data with 8 GPUs, run:
51+
52+
```
53+
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 \
54+
--model=resnet50 --optimizer=momentum --variable_update=replicated \
55+
--nodistortions --gradient_repacking=8 --num_gpus=8 \
56+
--num_epochs=90 --weight_decay=1e-4 --data_dir=${DATA_DIR} --use_fp16 \
57+
--train_dir=${CKPT_DIR}
58+
```
59+
This will train a ResNet-50 model on ImageNet with 2048 batch size on 8
60+
GPUs. The model should train to around 76% accuracy.
61+
62+
## Running the tests
63+
64+
To run the tests, run
65+
66+
```bash
67+
pip install portpicker
68+
python run_tests.py && python run_tests.py --run_distributed_tests
69+
```
70+
71+
Note the tests require portpicker.
72+
73+
The command above runs a subset of tests that is both fast and fairly
74+
comprehensive. Alternatively, all the tests can be run, but this will take a
75+
long time:
76+
77+
```bash
78+
python run_tests.py --full_tests && python run_tests.py --full_tests --run_distributed_tests
79+
```
80+
81+
We will run all tests on every PR before merging them, so it is not necessary
82+
to pass `--full_tests` when running tests yourself.
83+
84+
To run an individual test, such as method `testParameterServer` of test class
85+
`TfCnnBenchmarksTest` of module `benchmark_cnn_test`, run
86+
87+
```bash
88+
python -m unittest -v benchmark_cnn_test.TfCnnBenchmarksTest.testParameterServer
89+
```

0 commit comments

Comments
 (0)