|
| 1 | + |
| 2 | +# Convolutional neural network training scripts |
| 3 | + |
| 4 | +This script implements a number of popular CNN models and demonstrates |
| 5 | +efficient single-node training on multi-GPU systems. It can be used for |
| 6 | +benchmarking, training and evaluation of models. |
| 7 | + |
| 8 | +Uber's Horovod data-parallel framework is used for parallelization. |
| 9 | + |
| 10 | +## Imagenet data preprocessing |
| 11 | + |
| 12 | +See [this file](data_preprocessing/README.md) for instructions on downloading |
| 13 | +and preprocessing the imagenet data set. |
| 14 | + |
| 15 | +## ResNet50 training example |
| 16 | + |
| 17 | +The following command initiates training of the ResNet50 model distributed |
| 18 | +across 8 GPUs using fp16 arithmetic. We assume Imagenet is saved in TFRecord |
| 19 | +format at /data/imagenet_tfrecord. |
| 20 | + |
| 21 | +``` |
| 22 | + $ mpiexec -np 8 python nvcnn_hvd.py \ |
| 23 | + --model=resnet50 \ |
| 24 | + --data_dir=/data/imagenet_tfrecord \ |
| 25 | + --batch_size=256 \ |
| 26 | + --fp16 \ |
| 27 | + --larc_mode=clip \ |
| 28 | + --larc_eta=0.003 \ |
| 29 | + --loss_scale=128 \ |
| 30 | + --log_dir=./checkpoint-dir \ |
| 31 | + --save_interval=3600 \ |
| 32 | + --num_epochs=90 \ |
| 33 | + --display_every=100 |
| 34 | + --learning_rate=2.0 |
| 35 | +``` |
| 36 | + |
| 37 | +After training, the network should achieve a Top 1 accuracy of around 75.5% on |
| 38 | +the validation data set. |
| 39 | + |
| 40 | +## Inception V3 training example |
| 41 | + |
| 42 | +The following command initiates training of the Inception V3 model distributed |
| 43 | +across 8 GPUs using fp16 arithmetic. We assume Imagenet is saved in TFRecord |
| 44 | +format at /data/imagenet_tfrecord. |
| 45 | + |
| 46 | +``` |
| 47 | + $ mpiexec -np 8 python nvcnn_hvd.py \ |
| 48 | + --model=inception3 \ |
| 49 | + --data_dir=/data/imagenet_tfrecord \ |
| 50 | + --batch_size=128 \ |
| 51 | + --fp16 \ |
| 52 | + --larc_mode=clip \ |
| 53 | + --larc_eta=0.003 \ |
| 54 | + --loss_scale=128 \ |
| 55 | + --log_dir=./checkpoint-dir \ |
| 56 | + --save_interval=3600 \ |
| 57 | + --num_epochs=90 \ |
| 58 | + --display_every=100 \ |
| 59 | + --learning_rate=1.0 |
| 60 | +``` |
| 61 | + |
| 62 | +## Evaluating accuracy with the test set |
| 63 | + |
| 64 | +Model parameters are stored in FP32 precision when training with either FP32 or |
| 65 | +FP16 arithmetic. Thus the `--fp16` flag is not needed for eval jobs. Also, |
| 66 | +evaluation is performed on a single GPU. The following command performs |
| 67 | +evaluation of a trained model. |
| 68 | + |
| 69 | +``` |
| 70 | + $ python nvcnn_hvd.py --model=resnet50 \ |
| 71 | + --data_dir=/data/imagenet_tfrecord \ |
| 72 | + --batch_size=256 \ |
| 73 | + --log_dir=./checkpoint-dir \ |
| 74 | + --eval |
| 75 | +``` |
| 76 | + |
| 77 | +After trianing, ResNet50 and Inception_v3 should achieve top1 accuracies of |
| 78 | +75.5% and 77.8%, respectively, on the imagenet validation set. |
| 79 | + |
| 80 | +## Notes |
| 81 | + |
| 82 | +With the `--fp16` flag the model is trained using 16-bit floating-point |
| 83 | +operations. This provides optimized performance on Volta's TensorCores. |
| 84 | +For more information on training with FP16 arithmetic see |
| 85 | +[Training with Mixed Precision](http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html). |
| 86 | + |
| 87 | +If executing the training command above as root (for example in a Docker |
| 88 | +container), mpiexec requires an additional --allow-run-as-root flag. |
| 89 | + |
0 commit comments