Skip to content

Commit 1bd98ac

Browse files
mmarcinkiewicznv-kkudrynski
authored andcommitted
[RN50/MXNet] Release 22.10
1 parent eb35710 commit 1bd98ac

File tree

7 files changed

+45
-32
lines changed

7 files changed

+45
-32
lines changed
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/mxnet:20.12-py3
1+
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/mxnet:22.10-py3
22

33
FROM $FROM_IMAGE_NAME
44

@@ -7,4 +7,6 @@ WORKDIR /workspace/rn50
77
COPY requirements.txt .
88
RUN pip install -r requirements.txt
99

10+
ENV MXNET_CUDNN_AUTOTUNE_DEFAULT=0
11+
1012
COPY . .

MxNet/Classification/RN50v1.5/README.md

Lines changed: 27 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ The following section lists the requirements that you need to meet in order to s
168168

169169
This repository contains Dockerfile which extends the MXNet NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
170170
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
171-
- [MXNet 20.12-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia%2Fmxnet)
171+
- [MXNet 22.10-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia%2Fmxnet)
172172
Supported GPUs:
173173
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
174174
- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
@@ -585,18 +585,18 @@ The following sections provide details on how we achieved our performance and ac
585585

586586
**90 epochs configuration**
587587

588-
Our results were obtained by running 8 times the `./runner -n <number of gpus> -b 256 --dtype float32` script for TF32 and the `./runner -n <number of gpus> -b 256` script for mixed precision in the mxnet-20.12-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
588+
Our results were obtained by running 8 times the `./runner -n <number of gpus> -b 512 --dtype float32` script for TF32 and the `./runner -n <number of gpus> -b 512` script for mixed precision in the mxnet-22.10-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
589589

590590
| **GPUs** | **Accuracy - mixed precision** | **Accuracy - TF32** | **Time to train - mixed precision** | **Time to train - TF32** | **Time to train - speedup** |
591-
|:---:|:---:|:---:|:---:|:---:|:---:|
592-
|1|77.185|77.184|14.6|31.26|2.13|
593-
|8|77.185|77.184|1.8|4.0|2.12|
591+
|:---:|:---:|:---:|:--:|:---:|:---:|
592+
|1|77.185|77.184|8.75|29.39|3.36|
593+
|8|77.185|77.184|1.14|3.82|3.35|
594594

595595
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
596596

597597
**90 epochs configuration**
598598

599-
Our results were obtained by running the `./runner -n <number of gpus> -b 96 --dtype float32` training script for FP32 and the `./runner -n <number of gpus> -b 192` training script for mixed precision in the mxnet-20.12-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
599+
Our results were obtained by running the `./runner -n <number of gpus> -b 96 --dtype float32` training script for FP32 and the `./runner -n <number of gpus> -b 192` training script for mixed precision in the mxnet-22.10-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
600600

601601
| **GPUs** | **Accuracy - mixed precision** | **Accuracy - FP32** | **Time to train - mixed precision** | **Time to train - FP32** | **Time to train - speedup** |
602602
|:---:|:---:|:---:|:---:|:---:|:---:|
@@ -641,18 +641,17 @@ Here are example graphs of FP32 and mixed precision training on 8 GPU 250 epochs
641641
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
642642

643643
The following results were obtained by running the
644-
`python benchmark.py -n 1,2,4,8 -b 256 --dtype float32 -o benchmark_report_tf32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for TF32 and the
645-
`python benchmark.py -n 1,2,4,8 -b 256 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision in the mxnet-20.12-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
644+
`python benchmark.py -n 1,4,8 -b 512 --dtype float32 -o benchmark_report_tf32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for TF32 and the
645+
`python benchmark.py -n 1,4,8 -b 512 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision in the mxnet-22.10-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
646646

647647
Training performance reported as Total IPS (data + compute time taken into account).
648648
Weak scaling is calculated as a ratio of speed for given number of GPUs to speed for 1 GPU.
649649

650650
| **GPUs** | **Throughput - mixed precision** | **Throughput - TF32** | **Throughput speedup (TF32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - TF32** |
651651
|:---:|:---:|:---:|:---:|:---:|:---:|
652-
|1|2180 |1022 |2.18 |1.00 |1.00 |
653-
|2|4332 |2032 |2.13 |1.98 |1.98 |
654-
|4|8587 |4035 |2.12 |3.93 |3.94 |
655-
|8|16925|8001 |2.11 |7.76 |7.82 |
652+
|1|3410.52 |1055.78 |2.18 |1.00 |1.00 |
653+
|4|13442.66 |4182.30 |3.24 |3.97 |3.96 |
654+
|8|26673.72|8247.44 |3.23 |7.82 |7.81 |
656655

657656
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
658657

@@ -693,23 +692,24 @@ Weak scaling is calculated as a ratio of speed for given number of GPUs to speed
693692

694693
The following results were obtained by running the
695694
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float16 -o inferbenchmark_report_fp16.json -i 500 -e 3 -w 1 --mode val` script for mixed precision and the
696-
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_tf32.json -i 500 -e 3 -w 1 --mode val` script for TF32 in the mxnet-20.12-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
695+
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_tf32.json -i 500 -e 3 -w 1 --mode val` script for TF32 in the mxnet-22.10-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
697696

698697
Inference performance reported as Total IPS (data + compute time taken into account).
699698
Reported mixed precision speedups are relative to TF32 numbers for corresponding configuration.
700699

701700
| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** |
702701
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
703-
| 1 | 463 | 1.72 | 2.15 | 1.72 | 2.10 | 1.58 | 2.23 | 1.58 | 2.39 | 1.56 | 2.94 | 1.79 |
704-
| 2 | 880 | 1.62 | 2.27 | 1.62 | 2.14 | 1.66 | 2.52 | 1.54 | 2.73 | 1.50 | 3.70 | 1.42 |
705-
| 4 | 1668| 1.76 | 2.39 | 1.76 | 2.21 | 1.86 | 2.70 | 1.66 | 3.30 | 1.44 | 5.72 | 1.01 |
706-
| 8 | 2522| 1.75 | 3.17 | 1.75 | 2.74 | 2.00 | 4.26 | 1.35 | 5.36 | 1.10 | 10.43| 0.65 |
707-
| 16 | 3704| 1.90 | 4.31 | 1.90 | 3.83 | 2.13 | 6.00 | 1.43 | 7.20 | 1.24 | 12.77| 0.85 |
708-
| 32 | 2964| 1.51 | 10.79| 1.51 | 10.45| 1.52 | 14.52| 1.37 | 16.07| 1.32 | 22.76| 1.21 |
709-
| 64 | 4547| 1.80 | 14.07| 1.80 | 13.75| 1.82 | 17.16| 1.67 | 19.04| 1.59 | 28.12| 1.28 |
710-
| 128 | 5530| 1.94 | 23.14| 1.94 | 23.63| 1.82 | 29.04| 1.71 | 32.75| 1.56 | 41.45| 1.34 |
711-
| 192 | 6198| 2.19 | 30.97| 2.19 | 31.02| 2.21 | 40.04| 1.81 | 44.03| 1.68 | 51.44| 1.51 |
712-
| 256 | 6120| 2.19 | 41.82| 2.19 | 42.01| 2.19 | 50.72| 1.89 | 55.09| 1.77 | 63.08| 1.60 |
702+
| 1 | 1431.99 | 1.9 | 0.7 | 1.9 | 0.68 | 1.95 | 0.71 | 1.9 | 0.84 | 1.65 | 0.88 | 1.7 |
703+
| 2 | 2530.66 | 2.19 | 0.79 | 2.19 | 0.74 | 2.31 | 0.86 | 2.05 | 0.93 | 2.0 | 2.0 | 0.97 |
704+
| 4 | 3680.74 | 2.11 | 1.09 | 2.11 | 0.92 | 2.49 | 1.21 | 1.98 | 1.64 | 1.51 | 6.03 | 0.45 |
705+
| 8 | 2593.88 | 1.11 | 3.08 | 1.11 | 2.89 | 1.17 | 4.09 | 0.89 | 4.72 | 0.8 | 9.85 | 0.55 |
706+
| 16 | 4340.08 | 1.52 | 3.69 | 1.52 | 3.31 | 1.68 | 4.73 | 1.24 | 6.3 | 0.95 | 12.31 | 0.54 |
707+
| 32 | 6808.22 | 2.1 | 4.7 | 2.1 | 4.0 | 2.46 | 6.44 | 1.58 | 9.01 | 1.15 | 15.88 | 0.68 |
708+
| 64 | 7659.96 | 2.21 | 8.36 | 2.21 | 7.44 | 2.48 | 10.76 | 1.75 | 13.91 | 1.37 | 21.96 | 0.9 |
709+
| 128 | 8017.67 | 2.23 | 15.96 | 2.23 | 15.0 | 2.37 | 18.95 | 1.9 | 21.65 | 1.67 | 30.36 | 1.23 |
710+
| 192 | 8240.8 | 2.26 | 23.3 | 2.26 | 22.49 | 2.33 | 25.65 | 2.07 | 27.54 | 1.94 | 37.19 | 1.5 |
711+
| 256 | 7909.62 | 2.15 | 32.37 | 2.15 | 31.66 | 2.2 | 34.27 | 2.05 | 37.02 | 1.9 | 42.83 | 1.66 |
712+
| 512 | 7213.43 | 2.07 | 70.98 | 2.07 | 70.48 | 2.08 | 73.21 | 2.04 | 74.38 | 2.03 | 79.15 | 1.99 |
713713

714714

715715
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
@@ -771,7 +771,10 @@ Reported mixed precision speedups are relative to FP32 numbers for corresponding
771771
3. February, 2021
772772
* DGX-A100 performance results
773773
* Container version upgraded to 20.12
774-
774+
4. December, 2022
775+
* Container version upgraded to 22.10
776+
* Updated the A100 performance results. V100 and T4 performance results reflect the performance using the 20.12 container
777+
775778

776779
### Known Issues
777780

MxNet/Classification/RN50v1.5/benchmark.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ def int_list(x):
7979
try:
8080

8181
with open(log_file, 'r') as f:
82-
lines = f.read().splitlines()
82+
lines = [line for line in f.read().splitlines() if 'step' in line]
8383
log_data = [json.loads(line[5:]) for line in lines]
8484
epochs_report = list(filter(lambda x: len(x['step']) == 1, log_data))
8585

MxNet/Classification/RN50v1.5/dali.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,12 @@ def add_dali_args(parser):
2626
group = parser.add_argument_group('DALI data backend', 'entire group applies only to dali data backend')
2727
group.add_argument('--dali-separ-val', action='store_true',
2828
help='each process will perform independent validation on whole val-set')
29-
group.add_argument('--dali-threads', type=int, default=4, help="number of threads" +\
29+
group.add_argument('--dali-threads', type=int, default=6, help="number of threads" +\
3030
"per GPU for DALI")
3131
group.add_argument('--dali-validation-threads', type=int, default=10, help="number of threads" +\
3232
"per GPU for DALI for validation")
33-
group.add_argument('--dali-prefetch-queue', type=int, default=2, help="DALI prefetch queue depth")
34-
group.add_argument('--dali-nvjpeg-memory-padding', type=int, default=64, help="Memory padding value for nvJPEG (in MB)")
33+
group.add_argument('--dali-prefetch-queue', type=int, default=5, help="DALI prefetch queue depth")
34+
group.add_argument('--dali-nvjpeg-memory-padding', type=int, default=256, help="Memory padding value for nvJPEG (in MB)")
3535
group.add_argument('--dali-fuse-decoder', type=int, default=1, help="0 or 1 whether to fuse decoder or not")
3636

3737
group.add_argument('--dali-nvjpeg-width-hint', type=int, default=5980, help="Width hint value for nvJPEG (in pixels)")

MxNet/Classification/RN50v1.5/fit.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ def should_end(self) -> bool:
8383
return bool(self.t[0] > 0)
8484

8585
def _signal_handler(self, signum, frame):
86-
print("Signal reveived")
86+
print("Signal received")
8787
self.t[0] = 1
8888

8989

MxNet/Classification/RN50v1.5/runner

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,11 +73,19 @@ os.environ['MXNET_EXEC_ENABLE_ADDTO'] = "1"
7373
os.environ['MXNET_USE_TENSORRT'] = "0"
7474
os.environ['MXNET_GPU_WORKER_NTHREADS'] = "2"
7575
os.environ['MXNET_GPU_COPY_NTHREADS'] = "1"
76-
os.environ['MXNET_OPTIMIZER_AGGREGATION_SIZE'] = "54"
76+
os.environ['MXNET_OPTIMIZER_AGGREGATION_SIZE'] = "60"
7777
os.environ['HOROVOD_CYCLE_TIME'] = "0.1"
7878
os.environ['HOROVOD_FUSION_THRESHOLD'] = "67108864"
7979
os.environ['MXNET_HOROVOD_NUM_GROUPS'] = "16"
8080
os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD'] = "999"
8181
os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD'] = "25"
8282

83+
os.environ['MXNET_ENABLE_CUDA_GRAPHS'] = "1"
84+
os.environ['MXNET_ASYNC_GPU_ENGINE'] = "1"
85+
os.environ['HOROVOD_ENABLE_ASYNC_COMPLETION'] = "1"
86+
os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = "0"
87+
os.environ['HOROVOD_BATCH_D2D_MEMCOPIES'] = "1"
88+
os.environ['HOROVOD_GROUPED_ALLREDUCES'] = "1"
89+
os.environ['OMP_NUM_THREADS'] = "1"
90+
8391
os.execvp(command[0], command)

MxNet/Classification/RN50v1.5/train.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
#
2020
# -----------------------------------------------------------------------
2121
#
22-
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
22+
# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.
2323
#
2424
# Licensed under the Apache License, Version 2.0 (the "License");
2525
# you may not use this file except in compliance with the License.

0 commit comments

Comments
 (0)