Skip to content

Commit 2fdaec9

Browse files
committed
Add TensorFlow examples
1 parent 36dfac2 commit 2fdaec9

15 files changed

Lines changed: 75511 additions & 0 deletions

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[submodule "TensorFlow/OpenSeq2Seq"]
2+
path = TensorFlow/OpenSeq2Seq
3+
url = https://github.com/NVIDIA/OpenSeq2Seq
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
2+
# Convolutional neural network training scripts
3+
4+
This script implements a number of popular CNN models and demonstrates
5+
efficient single-node training on multi-GPU systems. It can be used for
6+
benchmarking, training and evaluation of models.
7+
8+
Uber's Horovod data-parallel framework is used for parallelization.
9+
10+
## Imagenet data preprocessing
11+
12+
See [this file](data_preprocessing/README.md) for instructions on downloading
13+
and preprocessing the imagenet data set.
14+
15+
## ResNet50 training example
16+
17+
The following command initiates training of the ResNet50 model distributed
18+
across 8 GPUs using fp16 arithmetic. We assume Imagenet is saved in TFRecord
19+
format at /data/imagenet_tfrecord.
20+
21+
```
22+
$ mpiexec -np 8 python nvcnn_hvd.py \
23+
--model=resnet50 \
24+
--data_dir=/data/imagenet_tfrecord \
25+
--batch_size=256 \
26+
--fp16 \
27+
--larc_mode=clip \
28+
--larc_eta=0.003 \
29+
--loss_scale=128 \
30+
--log_dir=./checkpoint-dir \
31+
--save_interval=3600 \
32+
--num_epochs=90 \
33+
--display_every=100
34+
--learning_rate=2.0
35+
```
36+
37+
After training, the network should achieve a Top 1 accuracy of around 75.5% on
38+
the validation data set.
39+
40+
## Inception V3 training example
41+
42+
The following command initiates training of the Inception V3 model distributed
43+
across 8 GPUs using fp16 arithmetic. We assume Imagenet is saved in TFRecord
44+
format at /data/imagenet_tfrecord.
45+
46+
```
47+
$ mpiexec -np 8 python nvcnn_hvd.py \
48+
--model=inception3 \
49+
--data_dir=/data/imagenet_tfrecord \
50+
--batch_size=128 \
51+
--fp16 \
52+
--larc_mode=clip \
53+
--larc_eta=0.003 \
54+
--loss_scale=128 \
55+
--log_dir=./checkpoint-dir \
56+
--save_interval=3600 \
57+
--num_epochs=90 \
58+
--display_every=100 \
59+
--learning_rate=1.0
60+
```
61+
62+
## Evaluating accuracy with the test set
63+
64+
Model parameters are stored in FP32 precision when training with either FP32 or
65+
FP16 arithmetic. Thus the `--fp16` flag is not needed for eval jobs. Also,
66+
evaluation is performed on a single GPU. The following command performs
67+
evaluation of a trained model.
68+
69+
```
70+
$ python nvcnn_hvd.py --model=resnet50 \
71+
--data_dir=/data/imagenet_tfrecord \
72+
--batch_size=256 \
73+
--log_dir=./checkpoint-dir \
74+
--eval
75+
```
76+
77+
After trianing, ResNet50 and Inception_v3 should achieve top1 accuracies of
78+
75.5% and 77.8%, respectively, on the imagenet validation set.
79+
80+
## Notes
81+
82+
With the `--fp16` flag the model is trained using 16-bit floating-point
83+
operations. This provides optimized performance on Volta's TensorCores.
84+
For more information on training with FP16 arithmetic see
85+
[Training with Mixed Precision](http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
86+
87+
If executing the training command above as root (for example in a Docker
88+
container), mpiexec requires an additional --allow-run-as-root flag.
89+
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
ImageNet Dataset Build Scripts
2+
==========================
3+
4+
What is this?
5+
------------------
6+
7+
This directory includes all the scripts necessary to build the ImageNet dataset in a sharded protobuf representation, which is recommended by the TensorFlow team for performance reasons. The protobuf files will include a large number of JPEG images in one file, along with the image metadata (image class, bounding boxes, etc.). This will ensure good performance on both SSDs and magnetic hard drives. The protobufs will contain TFRecord data types, which are standard for TensorFlow.
8+
9+
10+
Performance considerations
11+
----------------------------------------
12+
13+
This script is largely based on TensorFlow's ImageNet preprocessing script for the Inception v3 model (see [here](https://github.com/tensorflow/models/tree/master/inception/inception/data)). The advantages of the dataset preprocessed in this fashion are discussed here.
14+
15+
1. Protobufs containing many JPEGs are much faster to process than reading raw JPEGs, especially on magnetic disks, by avoiding seek time. This also tends to help on SSDs, because of sequential reads, which are still a bit faster than random reads. This was the case in the original TensorFlow preprocessing script.
16+
2. This version of the preprocessing scripts is independent of [Bazel](https://bazel.build/), Google's build tool. The Bazel requirement to "build" the Python and shell scripts is unnecessary and is a heavy-weight step that can has been avoided here. The scripts work the same way as the public Google scripts, but one can run them immediately without needing to set up Bazel and going through incantations like ```bazel build inception/download_and_preprocess_imagenet``` (see [here](https://github.com/tensorflow/models/tree/master/inception)). This is a modification to the original Google script.
17+
3. Pre-resizing while building the dataset is essential for good performance.The speedup from uniformly-sized images can be significant relative to original ImageNet, while running AlexNet-OWT (the gains are smaller for compute-heavy models such as Inception v3 and ResNet-50). Note that the current implementation does not preserve the aspect ratio while creating uniform-sized images. An alternative would be to resize while preserving the aspect ratio, then crop. This script is meant to help the user with preprocessing the model for performance reasons, but it may need to be tweaked to provide the best machine learning results. It's open-source so the user can modify it to their liking. This is a modification to the original Google script.
18+
4. Efficient storage - the original JPEGs are stored with a [quality factor](https://en.wikipedia.org/wiki/JPEG) of 100, but the color distortions tend not to happen until Q drops below 85. These scripts store the images with a Q factor of 90 by default, which reduces the image size by 75% while causing minimal color distortions of no consequence for convolutional neural network training. The impact of this is very significant due to a reduction in I/O load, particularly in a multi-GPU setting, when more total disk accesses need to take place to feed more than one GPU. This is a modification to the original Google script.
19+
20+
----------
21+
22+
How to run the scripts?
23+
--------------------------------
24+
25+
1. Create an ImageNet account at http://image-net.org. You will need a user ID
26+
and the access key provided upon registration.
27+
2. Run the download-imagenet.sh script. You will be asked for your ImageNet user ID, ImageNet password, and the directory in which to store the dataset. Future re-running of this script will be optimized in that if the tarballs containing the dataset are already available in the target directory, they won't be re-downloaded. However, the unzipping of the tarballs will still take place, so if you already ran this script, don't run it again.
28+
3. Run the generate_tfrecord_protos.sh script. You will be asked about the location of the files downloaded in step 2, as well as the directory in which to store the protobuf files to be used by TensorFlow. Additional questions will pertain to whether original or pre-resized images are to be stored (it is strongly recommended that pre-resizing be chosen), the height and width of the images after resizing (if pre-resizing is chosen), and the JPEG Q factor (Q=90 is recommended).
29+
30+
> **Note:**
31+
32+
> Running this script requires a lot of memory. Make sure you have at least 16 GB RAM free on your machine, and preferably 32 GB. This is due to a bug in TensorFlow that is currently being investigaged. It has nothing to do with the script, rather with TensorFlow core. It shows up only in case of datasets with millions of files. Once the ImageNet dataset size is reduced to a few thousand files (as it will be after these scripts are run, due to storage of many JPEGs in one protobuf), we could no longer replicate it while training models.
33+
34+
> If the script fails, examine the dmesg output - it likely failed due to an out of memory error. If that was the case, try freeing up memory. There is an included Python script called purge_mem_caches.py, which can be run on the host. This usually helps fix things that can't be resolved by just shutting down applications, such as purging virtual memory pages left by previous runs of TensorFlow, that weren't cleaned up by either the application or the OS. Note that running this script in the container itself, rather than the host, won't have any effect. Note also that this script has a dependency on psutil and pexpect, Python packages that can be installed using pip.gg

0 commit comments

Comments
 (0)