Skip to content

Commit 25e92d9

Browse files
committed
revise README.md and Makefile
1 parent 31d744a commit 25e92d9

File tree

4 files changed

+154
-56
lines changed

4 files changed

+154
-56
lines changed

examples/MNIST/Makefile

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,37 @@ check_PROGRAMS = mnist_main.py
88
MNIST_URL = https://raw.githubusercontent.com/pytorch/examples/main/mnist/main.py
99

1010
mnist_main.py:
11-
curl -Ls $(MNIST_URL) -o $@
12-
patch -st $@ < mnist.patch
11+
@curl -Ls $(MNIST_URL) -o $@
12+
@patch -st $@ < mnist.patch
13+
14+
# https://yann.lecun.com/exdb/mnist
15+
MNIST_DATA_URL = https://github.com/golbin/TensorFlow-MNIST/raw/master/mnist/data
16+
17+
MNIST_DATASETS = train-images-idx3-ubyte \
18+
train-labels-idx1-ubyte \
19+
t10k-images-idx3-ubyte \
20+
t10k-labels-idx1-ubyte
21+
22+
MNIST_DATASETS_GZ = $(MNIST_DATASETS:=.gz)
23+
24+
train-images-idx3-ubyte:
25+
@curl -LOs $(MNIST_DATA_URL)/$@.gz
26+
@gunzip $@.gz
27+
28+
train-labels-idx1-ubyte:
29+
@curl -LOs $(MNIST_DATA_URL)/$@.gz
30+
@gunzip $@.gz
31+
32+
t10k-images-idx3-ubyte:
33+
@curl -LOs $(MNIST_DATA_URL)/$@.gz
34+
@gunzip $@.gz
35+
36+
t10k-labels-idx1-ubyte:
37+
@curl -LOs $(MNIST_DATA_URL)/$@.gz
38+
@gunzip $@.gz
39+
40+
mnist_images.nc: $(MNIST_DATASETS)
41+
@python create_mnist_netcdf.py
1342

1443
all:
1544

@@ -21,5 +50,8 @@ ptests check: mnist_main.py mnist_images.nc
2150
@echo ""
2251

2352
clean:
24-
rm -rf mnist_main.py
53+
rm -f mnist_main.py
54+
rm -f $(MNIST_DATASETS)
55+
rm -f $(MNIST_DATASETS_GZ)
56+
rm -f mnist_images.nc
2557

examples/MNIST/README.md

Lines changed: 102 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,109 @@
1-
# PnetCDF-python MNIST example
1+
# MNIST example using PnetCDF-Python to Read Input Data
22

3-
This directory contains the description and run instructions for the MNIST example Python programs that utilize PnetCDF for file I/O and parallel training with MNIST data.
4-
5-
## Directory Structure
6-
7-
- **MNIST_data**: This folder contains a mini MNIST test dataset stored in a NetCDF file (`mnist_images_mini.nc`). The file includes:
8-
- 60 training samples
9-
- 12 testing samples
10-
11-
- **MNIST_codes**: This folder contains the example MNIST training code. The example code is based on the [PyTorch MNIST example](https://github.com/pytorch/examples/tree/main/mnist) and uses `DistributedDataParallel` for parallel training.
3+
This directory contains files for running the Pytorch example program,
4+
[MNIST](https://github.com/pytorch/examples/tree/main/mnist),
5+
using Pytorch module `DistributedDataParallel` for parallel training and
6+
`PnetCDF-Python` for reading data from a NetCDF files.
127

8+
---
139
## Running the MNIST Example Program
1410

15-
To run the MNIST example program, use the `mpiexec` command. The example below runs the program on 4 MPI processes.
16-
17-
### Command:
18-
19-
```sh
20-
mpiexec -n 4 python main.py
21-
```
22-
23-
### Expected Output:
24-
25-
When using 4 MPI processes, the output is expected to be similar to the following:
26-
27-
```sh
28-
nprocs = 4 rank = 0 device = cpu mpi_size = 4 mpi_rank = 0
29-
nprocs = 4 rank = 2 device = cpu mpi_size = 4 mpi_rank = 2
30-
nprocs = 4 rank = 1 device = cpu mpi_size = 4 mpi_rank = 1
31-
nprocs = 4 rank = 3 device = cpu mpi_size = 4 mpi_rank = 3
32-
33-
Train Epoch: 1 Average Loss: 2.288340
34-
Test set: Average loss: 2.7425, Accuracy: 0/12 (0%)
35-
36-
Train Epoch: 2 Average Loss: 2.490800
37-
Test set: Average loss: 1.9361, Accuracy: 6/12 (50%)
38-
39-
Train Epoch: 3 Average Loss: 2.216520
40-
Test set: Average loss: 1.8703, Accuracy: 7/12 (58%)
41-
```
42-
11+
* Firstly, run commands below to generate the python program file and NetCDF file.
12+
```sh
13+
make mnist_main.py`
14+
make mnist_images.nc`
15+
```
16+
* Run command below to train the model using 4 MPI processes.
17+
```sh
18+
mpiexec -n 4 python mnist_main.py --batch-size 4 --test-batch-size 2 --epochs 3 --input-file mnist_images.nc
19+
```
20+
21+
## Testing
22+
* Command `make check` will do the following.
23+
+ Downloads the python source codes
24+
[main.py](https://github.com/pytorch/examples/blob/main/mnist/main.py)
25+
from [Pytorch Examples](https://github.com/pytorch/examples) as file
26+
`mnist_main.py`.
27+
+ Applies patch file [mnist.patch](./mnist.patch) to `mnist_main.py`.
28+
+ Downloads the MNIST data sets from []()
29+
+ Run utility program [create_mnist_netcdf.py](./create_mnist_netcdf.py)
30+
to extract a subset of images into a NetCDF file.
31+
+ Run the training program `mnist_main.py`.
32+
33+
* Testing output shown on screen.
34+
```
35+
=====================================================================
36+
examples/MNIST: Parallel testing on 4 MPI processes
37+
======================================================================
38+
Train Epoch: 1 [0/60 (0%)] Loss: 2.514259
39+
Train Epoch: 1 [10/60 (67%)] Loss: 1.953820
40+
41+
Test set: Average loss: 2.2113, Accuracy: 4/12 (33%)
42+
43+
Train Epoch: 2 [0/60 (0%)] Loss: 2.359334
44+
Train Epoch: 2 [10/60 (67%)] Loss: 2.092178
45+
46+
Test set: Average loss: 1.4825, Accuracy: 6/12 (50%)
47+
48+
Train Epoch: 3 [0/60 (0%)] Loss: 2.067438
49+
Train Epoch: 3 [10/60 (67%)] Loss: 0.010670
50+
51+
Test set: Average loss: 1.2531, Accuracy: 7/12 (58%)
52+
```
53+
54+
## mnist_main.py command-line options
55+
```
56+
-h, --help show this help message and exit
57+
--batch-size N input batch size for training (default: 64)
58+
--test-batch-size N input batch size for testing (default: 1000)
59+
--epochs N number of epochs to train (default: 14)
60+
--lr LR learning rate (default: 1.0)
61+
--gamma M Learning rate step gamma (default: 0.7)
62+
--no-cuda disables CUDA training
63+
--no-mps disables macOS GPU training
64+
--dry-run quickly check a single pass
65+
--seed S random seed (default: 1)
66+
--log-interval N how many batches to wait before logging training status
67+
--save-model For Saving the current Model
68+
--input-file INPUT_FILE
69+
NetCDF file storing train and test samples
70+
```
71+
72+
## create_mnist_netcdf.py command-line options
73+
```
74+
-h, --help show this help message and exit
75+
--verbose Verbose mode
76+
--train-size N Number of training samples extracted from the input file (default: 60)
77+
--test-size N Number of testing samples extracted from the input file (default: 12)
78+
--train-data-file TRAIN_DATA_FILE
79+
(Optional) input file name of training data
80+
--train-label-file TRAIN_LABEL_FILE
81+
(Optional) input file name of training labels
82+
--test-data-file TEST_DATA_FILE
83+
(Optional) input file name of testing data
84+
--test-label-file TEST_LABEL_FILE
85+
(Optional) input file name of testing labels
86+
```
87+
88+
---
89+
## Files in this directory
90+
* [mnist.patch](./mnist.patch) --
91+
a patch file to be applied on
92+
[main.py](https://github.com/pytorch/examples/blob/main/mnist/main.py)
93+
once downloaded from [Pytorch Examples](https://github.com/pytorch/examples)
94+
before running the model training.
95+
96+
* [comm_file.py](./comm_file.py) --
97+
implements the parallel environment for training the model in parallel.
98+
99+
* [pnetcdf_io.py](./pnetcdf_io.py) --
100+
implements the file I/O using PnetCDF-Python.
101+
102+
* [create_mnist_netcdf.py](./create_mnist_netcdf.py) --
103+
a utility python program that reads the MINST files, extract a subset of the
104+
samples, and stores them into a newly created file in NetCDF format.
105+
106+
---
43107
### Notes:
44108
- The test set accuracy may vary slightly depending on how the data is distributed across the MPI processes.
45109
- The accuracy and loss reported after each epoch are averaged across all MPI processes.

examples/MNIST/mnist.patch

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
--- mnist_main_original.py 2024-08-10 17:30:08.552324326 -0500
2-
+++ pnetcdf_mnist.py 2024-08-10 18:02:49.008705003 -0500
2+
+++ pnetcdf_mnist.py 2024-08-11 16:10:31.895471785 -0500
33
@@ -1,3 +1,8 @@
44
+#
55
+# Copyright (C) 2024, Northwestern University and Argonne National Laboratory
@@ -15,10 +15,10 @@
1515
from torch.optim.lr_scheduler import StepLR
1616
+from torch.nn.parallel import DistributedDataParallel as DDP
1717
+from torch.utils.data.distributed import DistributedSampler
18-
18+
1919
+import comm_file, pnetcdf_io
2020
+from mpi4py import MPI
21-
21+
2222
class Net(nn.Module):
2323
def __init__(self):
2424
@@ -42,14 +51,13 @@
@@ -32,27 +32,27 @@
3232
100. * batch_idx / len(train_loader), loss.item()))
3333
if args.dry_run:
3434
break
35-
35+
3636
-
3737
def test(model, device, test_loader):
3838
model.eval()
3939
test_loss = 0
4040
@@ -62,9 +70,14 @@
4141
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
4242
correct += pred.eq(target.view_as(pred)).sum().item()
43-
43+
4444
+ # aggregate loss among all ranks
4545
+ test_loss = comm.mpi_comm.allreduce(test_loss, op=MPI.SUM)
4646
+ correct = comm.mpi_comm.allreduce(correct, op=MPI.SUM)
4747
+
4848
test_loss /= len(test_loader.dataset)
49-
49+
5050
- print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
5151
+ if rank == 0:
5252
+ print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
5353
test_loss, correct, len(test_loader.dataset),
5454
100. * correct / len(test_loader.dataset)))
55-
55+
5656
@@ -94,6 +107,8 @@
5757
help='how many batches to wait before logging training status')
5858
parser.add_argument('--save-model', action='store_true', default=False,
@@ -65,7 +65,7 @@
6565
@@ -107,7 +122,7 @@
6666
else:
6767
device = torch.device("cpu")
68-
68+
6969
- train_kwargs = {'batch_size': args.batch_size}
7070
+ train_kwargs = {'batch_size': args.batch_size//nprocs}
7171
test_kwargs = {'batch_size': args.test_batch_size}
@@ -84,8 +84,8 @@
8484
+
8585
+ # Open files storing training and testing samples
8686
+ infile = args.input_file
87-
+ train_file = pnetcdf_io.dataset(infile, 'train_images', 'train_labels', transform, comm.mpi_comm)
88-
+ test_file = pnetcdf_io.dataset(infile, 'test_images', 'test_labels', transform, comm.mpi_comm)
87+
+ train_file = pnetcdf_io.dataset(infile, 'train_samples', 'train_labels', transform, comm.mpi_comm)
88+
+ test_file = pnetcdf_io.dataset(infile, 'test_samples', 'test_labels', transform, comm.mpi_comm)
8989
+
9090
+ # create distributed samplers
9191
+ train_sampler = DistributedSampler(train_file, num_replicas=nprocs, rank=rank, shuffle=True)
@@ -94,14 +94,14 @@
9494
+ # add distributed samplers to DataLoaders
9595
+ train_loader = torch.utils.data.DataLoader(train_file, sampler=train_sampler, **train_kwargs)
9696
+ test_loader = torch.utils.data.DataLoader(test_file, sampler=test_sampler, **test_kwargs, drop_last=False)
97-
97+
9898
model = Net().to(device)
9999
+
100100
+ # use DDP
101101
+ model = DDP(model, device_ids=[device] if use_cuda else None)
102102
+
103103
optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
104-
104+
105105
scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
106106
for epoch in range(1, args.epochs + 1):
107107
+ # train sampler set epoch
@@ -111,16 +111,16 @@
111111
train(args, model, device, train_loader, optimizer, epoch)
112112
test(model, device, test_loader)
113113
scheduler.step()
114-
114+
115115
if args.save_model:
116116
- torch.save(model.state_dict(), "mnist_cnn.pt")
117117
+ if rank == 0:
118118
+ torch.save(model.state_dict(), "mnist_cnn.pt")
119-
119+
120120
+ # close files
121121
+ train_file.close()
122122
+ test_file.close()
123-
123+
124124
if __name__ == '__main__':
125125
+ ## initialize parallel environment
126126
+ comm, device = comm_file.init_parallel()

examples/Makefile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,4 +61,6 @@ ptest8:
6161

6262
clean:
6363
rm -rf ${OUTPUT_DIR}
64+
cd Pytorch_DDP && make clean
65+
cd MNIST && make clean
6466

0 commit comments

Comments
 (0)