Skip to content

Latest commit

 

History

History
414 lines (318 loc) · 16.5 KB

File metadata and controls

414 lines (318 loc) · 16.5 KB

Benchmarks

Overview

A selection of image classification models were tested across multiple platforms to create a point of reference for the TensorFlow community. The Methodology section details how the tests were executed and has links to the scripts used.

Results for image classification models

InceptionV3 (arXiv:1512.00567), ResNet-50 (arXiv:1512.03385), ResNet-152 (arXiv:1512.03385), VGG16 (arXiv:1409.1556), and AlexNet were tested using the ImageNet data set. Tests were run on Google Compute Engine, Amazon Elastic Compute Cloud (Amazon EC2), and an NVIDIA® DGX-1™. Most of the tests were run with both synthetic and real data. Testing with synthetic data was done by using a tf.Variable set to the same shape as the data expected by each model for ImageNet. We believe it is important to include real data measurements when benchmarking a platform. This load tests both the underlying hardware and the framework at preparing data for actual training. We start with synthetic data to remove disk I/O as a variable and to set a baseline. Real data is then used to verify that the TensorFlow input pipeline and the underlying disk I/O are saturating the compute units.

Training with NVIDIA® DGX-1™ (NVIDIA® Tesla® P100)

Details and additional results are in the Details for NVIDIA® DGX-1™ (NVIDIA® Tesla® P100) section.

Training with NVIDIA® Tesla® K80

Details and additional results are in the Details for Google Compute Engine (NVIDIA® Tesla® K80) and Details for Amazon EC2 (NVIDIA® Tesla® K80) sections.

Distributed training with NVIDIA® Tesla® K80

Details and additional results are in the Details for Amazon EC2 Distributed (NVIDIA® Tesla® K80) section.

Compare synthetic with real data training

NVIDIA® Tesla® P100

NVIDIA® Tesla® K80

Details for NVIDIA® DGX-1™ (NVIDIA® Tesla® P100)

Environment

  • Instance type: NVIDIA® DGX-1™
  • GPU: 8x NVIDIA® Tesla® P100
  • OS: Ubuntu 16.04 LTS with tests run via Docker
  • CUDA / cuDNN: 8.0 / 5.1
  • TensorFlow GitHub hash: b1e174e
  • Benchmark GitHub hash: 9165a70
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • Disk: Local SSD
  • DataSet: ImageNet
  • Test Date: May 2017

Batch size and optimizer used for each model are listed in the table below. In addition to the batch sizes listed in the table, InceptionV3, ResNet-50, ResNet-152, and VGG16 were tested with a batch size of 32. Those results are in the other results section.

Options InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
Batch size per GPU 64 64 64 512 64
Optimizer sgd sgd sgd sgd sgd

Configuration used for each model.

Model variable_update local_parameter_device
InceptionV3 parameter_server cpu
ResNet50 parameter_server cpu
ResNet152 parameter_server cpu
AlexNet replicated (with NCCL) n/a
VGG16 replicated (with NCCL) n/a

Results

Training synthetic data

GPUs InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
1 142 219 91.8 2987 154
2 284 422 181 5658 295
4 569 852 356 10509 584
8 1131 1734 716 17822 1081

Training real data

GPUs InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
1 142 218 91.4 2890 154
2 278 425 179 4448 284
4 551 853 359 7105 534
8 1079 1630 708 N/A 898

Training AlexNet with real data on 8 GPUs was excluded from the graph and table above due to it maxing out the input pipeline.

Other Results

The results below are all with a batch size of 32.

Training synthetic data

GPUs InceptionV3 ResNet-50 ResNet-152 VGG16
1 128 195 82.7 144
2 259 368 160 281
4 520 768 317 549
8 995 1485 632 820

Training real data

GPUs InceptionV3 ResNet-50 ResNet-152 VGG16
1 130 193 82.4 144
2 257 369 159 253
4 507 760 317 457
8 966 1410 609 690

Details for Google Compute Engine (NVIDIA® Tesla® K80)

Environment

  • Instance type: n1-standard-32-k80x8
  • GPU: 8x NVIDIA® Tesla® K80
  • OS: Ubuntu 16.04 LTS
  • CUDA / cuDNN: 8.0 / 5.1
  • TensorFlow GitHub hash: b1e174e
  • Benchmark GitHub hash: 9165a70
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • Disk: 1.7 TB Shared SSD persistent disk (800 MB/s)
  • DataSet: ImageNet
  • Test Date: May 2017

Batch size and optimizer used for each model are listed in the table below. In addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were tested with a batch size of 32. Those results are in the other results section.

Options InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
Batch size per GPU 64 64 32 512 32
Optimizer sgd sgd sgd sgd sgd

The configuration used for each model was variable_update equal to parameter_server and local_parameter_device equal to cpu.

Results

Training synthetic data

GPUs InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
1 30.5 51.9 20.0 656 35.4
2 57.8 99.0 38.2 1209 64.8
4 116 195 75.8 2328 120
8 227 387 148 4640 234

Training real data

GPUs InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
1 30.6 51.2 20.0 639 34.2
2 58.4 98.8 38.3 1136 62.9
4 115 194 75.4 2067 118
8 225 381 148 4056 230

Other Results

Training synthetic data

GPUs InceptionV3 (batch size 32) ResNet-50 (batch size 32)
1 29.3 49.5
2 55.0 95.4
4 109 183
8 216 362

Training real data

GPUs InceptionV3 (batch size 32) ResNet-50 (batch size 32)
1 29.5 49.3
2 55.4 95.3
4 110 186
8 216 359

Details for Amazon EC2 (NVIDIA® Tesla® K80)

Environment

  • Instance type: p2.8xlarge
  • GPU: 8x NVIDIA® Tesla® K80
  • OS: Ubuntu 16.04 LTS
  • CUDA / cuDNN: 8.0 / 5.1
  • TensorFlow GitHub hash: b1e174e
  • Benchmark GitHub hash: 9165a70
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • Disk: 1TB Amazon EFS (burst 100 MiB/sec for 12 hours, continuous 50 MiB/sec)
  • DataSet: ImageNet
  • Test Date: May 2017

Batch size and optimizer used for each model are listed in the table below. In addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were tested with a batch size of 32. Those results are in the other results section.

Options InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
Batch size per GPU 64 64 32 512 32
Optimizer sgd sgd sgd sgd sgd

Configuration used for each model.

Model variable_update local_parameter_device
InceptionV3 parameter_server cpu
ResNet-50 replicated (without NCCL) gpu
ResNet-152 replicated (without NCCL) gpu
AlexNet parameter_server gpu
VGG16 parameter_server gpu

Results

Training synthetic data

GPUs InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
1 30.8 51.5 19.7 684 36.3
2 58.7 98.0 37.6 1244 69.4
4 117 195 74.9 2479 141
8 230 384 149 4853 260

Training real data

GPUs InceptionV3 ResNet-50 ResNet-152 AlexNet VGG16
1 30.5 51.3 19.7 674 36.3
2 59.0 94.9 38.2 1227 67.5
4 118 188 75.2 2201 136
8 228 373 149 N/A 242

Training AlexNet with real data on 8 GPUs was excluded from the graph and table above due to our EFS setup not providing enough throughput.

Other Results

Training synthetic data

GPUs InceptionV3 (batch size 32) ResNet-50 (batch size 32)
1 29.9 49.0
2 57.5 94.1
4 114 184
8 216 355

Training real data

GPUs InceptionV3 (batch size 32) ResNet-50 (batch size 32)
1 30.0 49.1
2 57.5 95.1
4 113 185
8 212 353

Details for Amazon EC2 Distributed (NVIDIA® Tesla® K80)

Environment

  • Instance type: p2.8xlarge
  • GPU: 8x NVIDIA® Tesla® K80
  • OS: Ubuntu 16.04 LTS
  • CUDA / cuDNN: 8.0 / 5.1
  • TensorFlow GitHub hash: b1e174e
  • Benchmark GitHub hash: 9165a70
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • Disk: 1.0 TB EFS (burst 100 MB/sec for 12 hours, continuous 50 MB/sec)
  • DataSet: ImageNet
  • Test Date: May 2017

The batch size and optimizer used for the tests are listed in the table. In addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were tested with a batch size of 32. Those results are in the other results section.

Options InceptionV3 ResNet-50 ResNet-152
Batch size per GPU 64 64 32
Optimizer sgd sgd sgd

Configuration used for each model.

Model variable_update local_parameter_device cross_replica_sync
InceptionV3 distributed_replicated n/a True
ResNet-50 distributed_replicated n/a True
ResNet-152 distributed_replicated n/a True

To simplify server setup, EC2 instances (p2.8xlarge) running worker servers also ran parameter servers. Equal numbers of parameter servers and worker servers were used with the following exceptions:

  • InceptionV3: 8 instances / 6 parameter servers
  • ResNet-50: (batch size 32) 8 instances / 4 parameter servers
  • ResNet-152: 8 instances / 4 parameter servers

Results

Training synthetic data

GPUs InceptionV3 ResNet-50 ResNet-152
1 29.7 52.4 19.4
8 229 378 146
16 459 751 291
32 902 1388 565
64 1783 2744 981

Other Results

Training synthetic data

GPUs InceptionV3 (batch size 32) ResNet-50 (batch size 32)
1 29.2 48.4
8 219 333
16 427 667
32 820 1180
64 1608 2315

Methodology

This script was run on the various platforms to generate the above results. @{$performance_models$High-Performance Models} details techniques in the script along with examples of how to execute the script.

In order to create results that are as repeatable as possible, each test was run 5 times and then the times were averaged together. GPUs are run in their default state on the given platform. For NVIDIA® Tesla® K80 this means leaving on GPU Boost. For each test, 10 warmup steps are done and then the next 100 steps are averaged.