|
1 |
| -# PnetCDF-python MNIST example |
| 1 | +# MNIST example using PnetCDF-Python to Read Input Data |
2 | 2 |
|
3 |
| -This directory contains the description and run instructions for the MNIST example Python programs that utilize PnetCDF for file I/O and parallel training with MNIST data. |
4 |
| - |
5 |
| -## Directory Structure |
6 |
| - |
7 |
| -- **MNIST_data**: This folder contains a mini MNIST test dataset stored in a NetCDF file (`mnist_images_mini.nc`). The file includes: |
8 |
| - - 60 training samples |
9 |
| - - 12 testing samples |
10 |
| - |
11 |
| -- **MNIST_codes**: This folder contains the example MNIST training code. The example code is based on the [PyTorch MNIST example](https://github.com/pytorch/examples/tree/main/mnist) and uses `DistributedDataParallel` for parallel training. |
| 3 | +This directory contains files for running the Pytorch example program, |
| 4 | +[MNIST](https://github.com/pytorch/examples/tree/main/mnist), |
| 5 | +using Pytorch module `DistributedDataParallel` for parallel training and |
| 6 | +`PnetCDF-Python` for reading data from a NetCDF files. |
12 | 7 |
|
| 8 | +--- |
13 | 9 | ## Running the MNIST Example Program
|
14 | 10 |
|
15 |
| -To run the MNIST example program, use the `mpiexec` command. The example below runs the program on 4 MPI processes. |
16 |
| - |
17 |
| -### Command: |
18 |
| - |
19 |
| -```sh |
20 |
| -mpiexec -n 4 python main.py |
21 |
| -``` |
22 |
| - |
23 |
| -### Expected Output: |
24 |
| - |
25 |
| -When using 4 MPI processes, the output is expected to be similar to the following: |
26 |
| - |
27 |
| -```sh |
28 |
| -nprocs = 4 rank = 0 device = cpu mpi_size = 4 mpi_rank = 0 |
29 |
| -nprocs = 4 rank = 2 device = cpu mpi_size = 4 mpi_rank = 2 |
30 |
| -nprocs = 4 rank = 1 device = cpu mpi_size = 4 mpi_rank = 1 |
31 |
| -nprocs = 4 rank = 3 device = cpu mpi_size = 4 mpi_rank = 3 |
32 |
| - |
33 |
| -Train Epoch: 1 Average Loss: 2.288340 |
34 |
| -Test set: Average loss: 2.7425, Accuracy: 0/12 (0%) |
35 |
| - |
36 |
| -Train Epoch: 2 Average Loss: 2.490800 |
37 |
| -Test set: Average loss: 1.9361, Accuracy: 6/12 (50%) |
38 |
| - |
39 |
| -Train Epoch: 3 Average Loss: 2.216520 |
40 |
| -Test set: Average loss: 1.8703, Accuracy: 7/12 (58%) |
41 |
| -``` |
42 |
| - |
| 11 | +* Firstly, run commands below to generate the python program file and NetCDF file. |
| 12 | + ```sh |
| 13 | + make mnist_main.py` |
| 14 | + make mnist_images.nc` |
| 15 | + ``` |
| 16 | +* Run command below to train the model using 4 MPI processes. |
| 17 | + ```sh |
| 18 | + mpiexec -n 4 python mnist_main.py --batch-size 4 --test-batch-size 2 --epochs 3 --input-file mnist_images.nc |
| 19 | + ``` |
| 20 | + |
| 21 | +## Testing |
| 22 | +* Command `make check` will do the following. |
| 23 | + + Downloads the python source codes |
| 24 | + [main.py](https://github.com/pytorch/examples/blob/main/mnist/main.py) |
| 25 | + from [Pytorch Examples](https://github.com/pytorch/examples) as file |
| 26 | + `mnist_main.py`. |
| 27 | + + Applies patch file [mnist.patch](./mnist.patch) to `mnist_main.py`. |
| 28 | + + Downloads the MNIST data sets from []() |
| 29 | + + Run utility program [create_mnist_netcdf.py](./create_mnist_netcdf.py) |
| 30 | + to extract a subset of images into a NetCDF file. |
| 31 | + + Run the training program `mnist_main.py`. |
| 32 | + |
| 33 | +* Testing output shown on screen. |
| 34 | + ``` |
| 35 | + ===================================================================== |
| 36 | + examples/MNIST: Parallel testing on 4 MPI processes |
| 37 | + ====================================================================== |
| 38 | + Train Epoch: 1 [0/60 (0%)] Loss: 2.514259 |
| 39 | + Train Epoch: 1 [10/60 (67%)] Loss: 1.953820 |
| 40 | +
|
| 41 | + Test set: Average loss: 2.2113, Accuracy: 4/12 (33%) |
| 42 | +
|
| 43 | + Train Epoch: 2 [0/60 (0%)] Loss: 2.359334 |
| 44 | + Train Epoch: 2 [10/60 (67%)] Loss: 2.092178 |
| 45 | +
|
| 46 | + Test set: Average loss: 1.4825, Accuracy: 6/12 (50%) |
| 47 | +
|
| 48 | + Train Epoch: 3 [0/60 (0%)] Loss: 2.067438 |
| 49 | + Train Epoch: 3 [10/60 (67%)] Loss: 0.010670 |
| 50 | +
|
| 51 | + Test set: Average loss: 1.2531, Accuracy: 7/12 (58%) |
| 52 | + ``` |
| 53 | + |
| 54 | +## mnist_main.py command-line options |
| 55 | + ``` |
| 56 | + -h, --help show this help message and exit |
| 57 | + --batch-size N input batch size for training (default: 64) |
| 58 | + --test-batch-size N input batch size for testing (default: 1000) |
| 59 | + --epochs N number of epochs to train (default: 14) |
| 60 | + --lr LR learning rate (default: 1.0) |
| 61 | + --gamma M Learning rate step gamma (default: 0.7) |
| 62 | + --no-cuda disables CUDA training |
| 63 | + --no-mps disables macOS GPU training |
| 64 | + --dry-run quickly check a single pass |
| 65 | + --seed S random seed (default: 1) |
| 66 | + --log-interval N how many batches to wait before logging training status |
| 67 | + --save-model For Saving the current Model |
| 68 | + --input-file INPUT_FILE |
| 69 | + NetCDF file storing train and test samples |
| 70 | + ``` |
| 71 | + |
| 72 | +## create_mnist_netcdf.py command-line options |
| 73 | + ``` |
| 74 | + -h, --help show this help message and exit |
| 75 | + --verbose Verbose mode |
| 76 | + --train-size N Number of training samples extracted from the input file (default: 60) |
| 77 | + --test-size N Number of testing samples extracted from the input file (default: 12) |
| 78 | + --train-data-file TRAIN_DATA_FILE |
| 79 | + (Optional) input file name of training data |
| 80 | + --train-label-file TRAIN_LABEL_FILE |
| 81 | + (Optional) input file name of training labels |
| 82 | + --test-data-file TEST_DATA_FILE |
| 83 | + (Optional) input file name of testing data |
| 84 | + --test-label-file TEST_LABEL_FILE |
| 85 | + (Optional) input file name of testing labels |
| 86 | + ``` |
| 87 | + |
| 88 | +--- |
| 89 | +## Files in this directory |
| 90 | +* [mnist.patch](./mnist.patch) -- |
| 91 | + a patch file to be applied on |
| 92 | + [main.py](https://github.com/pytorch/examples/blob/main/mnist/main.py) |
| 93 | + once downloaded from [Pytorch Examples](https://github.com/pytorch/examples) |
| 94 | + before running the model training. |
| 95 | + |
| 96 | +* [comm_file.py](./comm_file.py) -- |
| 97 | + implements the parallel environment for training the model in parallel. |
| 98 | + |
| 99 | +* [pnetcdf_io.py](./pnetcdf_io.py) -- |
| 100 | + implements the file I/O using PnetCDF-Python. |
| 101 | + |
| 102 | +* [create_mnist_netcdf.py](./create_mnist_netcdf.py) -- |
| 103 | + a utility python program that reads the MINST files, extract a subset of the |
| 104 | + samples, and stores them into a newly created file in NetCDF format. |
| 105 | + |
| 106 | +--- |
43 | 107 | ### Notes:
|
44 | 108 | - The test set accuracy may vary slightly depending on how the data is distributed across the MPI processes.
|
45 | 109 | - The accuracy and loss reported after each epoch are averaged across all MPI processes.
|
|
0 commit comments