triton

Deploying the TFT model on Triton Inference Server

This folder contains instructions for deployment to run inference on Triton Inference Server as well as a detailed performance analysis. The purpose of this document is to help you with achieving the best inference performance.

Solution overview

Introduction

The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.

This README provides step-by-step deployment instructions for models generated during training (as described in the model README). Additionally, this README provides the corresponding deployment scripts that ensure optimal GPU utilization during inferencing on Triton Inference Server.

Deployment process

The deployment process consists of two steps:

Conversion.

The purpose of conversion is to find the best performing model format supported by Triton Inference Server. Triton Inference Server uses a number of runtime backends such as TensorRT, LibTorch and ONNX Runtime to support various model types. Refer to the Triton documentation for a list of available backends.
Configuration.

Model configuration on Triton Inference Server, which generates necessary configuration files.

After deployment Triton inference server is used for evaluation of converted model in two steps:

Accuracy tests.

Produce results which are tested against given accuracy thresholds.
Performance tests.

Produce latency and throughput results for offline (static batching) and online (dynamic batching) scenarios.

All steps are executed by provided runner script. Refer to Quick Start Guide

Setup

Ensure you have the following components:

NVIDIA Docker
PyTorch NGC container 21.12
Triton Inference Server NGC container 21.12
NVIDIA CUDA
NVIDIA Ampere, Volta or Turing based GPU

Quick Start Guide

Running the following scripts will build and launch the container with all required dependencies for native PyTorch as well as Triton Inference Server. This is necessary for running inference and can also be used for data download, processing, and training of the model.

Clone the repository.

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/PyTorch/Forecasting/TFT

Prepare dataset. Please use the data download from the Main QSG
Build and run a container that extends NGC PyTorch with the Triton client libraries and necessary dependencies.

bash ./triton/scripts/docker/build.sh
bash ./triton/scripts/docker/interactive.sh /path/to/your/data/

Execute runner script (please mind, the run scripts are prepared per NVIDIA GPU).

NVIDIA A30: bash ./triton/runner/start_NVIDIA-A30.sh

NVIDIA DGX-1 (1x V100 32GB): bash ./triton/runner/start_NVIDIA-DGX-1-\(1x-V100-32GB\).sh

NVIDIA DGX A100 (1x A100 80GB): bash ./triton/runner/start_NVIDIA-DGX-A100-\(1x-A100-80GB\).sh

NVIDIA T4: bash ./triton/runner/start_NVIDIA-T4.sh

If one encounters an error like the provided PTX was compiled with an unsupported toolchain, follow the steps in Step by step deployment process.

Performance

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Offline scenario

The offline scenario assumes the client and server are located on the same host. The tests uses:

tensors are passed through shared memory between client and server, the Perf Analyzer flag shared-memory=system is used
single request is send from client to server with static size of batch

Offline: NVIDIA A30, NVIDIA TensorRT with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA A30
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	595.0	0.0	0.2	0.1	0.1	1.3	0.0	1.7	1.7	1.8	1.8	1.7
2	1	804.6	0.0	0.1	0.0	0.1	2.1	0.1	2.5	2.6	2.6	2.6	2.5
4	1	1500.0	0.0	0.2	0.1	0.1	2.2	0.1	2.7	2.7	2.7	2.8	2.7
8	1	2696.0	0.1	0.2	0.1	0.1	2.5	0.0	2.9	3.0	3.1	3.3	3.0
16	1	4704.0	0.1	0.2	0.1	0.1	2.9	0.0	3.4	3.5	3.6	3.8	3.4
32	1	8576.0	0.1	0.2	0.0	0.1	3.2	0.1	3.7	3.9	3.9	4.0	3.7
64	1	14101.3	0.1	0.2	0.0	0.1	4.0	0.0	4.5	4.6	4.7	5.2	4.5
128	1	19227.2	0.1	0.2	0.1	0.1	6.1	0.0	6.5	6.7	8.0	8.3	6.6
256	1	24401.3	0.1	0.3	0.1	0.2	9.8	0.0	10.4	10.5	11.4	11.6	10.5
512	1	27235.7	0.1	0.4	0.1	1.0	17.1	0.1	18.8	18.8	18.8	18.8	18.8
1024	1	28782.6	0.1	0.4	0.1	1.9	32.9	0.2	35.5	35.6	35.6	35.7	35.5

Offline: NVIDIA A30, NVIDIA TensorRT with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA A30
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	605.4	0.0	0.2	0.0	0.1	1.3	0.0	1.6	1.7	1.7	1.7	1.6
2	1	840.0	0.0	0.1	0.0	0.1	2.1	0.0	2.4	2.4	2.4	2.5	2.4
4	1	1638.0	0.0	0.1	0.0	0.1	2.2	0.0	2.4	2.5	2.5	2.6	2.4
8	1	2876.0	0.0	0.1	0.0	0.1	2.5	0.0	2.8	2.9	2.9	2.9	2.8
16	1	5168.0	0.0	0.1	0.0	0.1	2.8	0.0	3.1	3.3	3.3	3.4	3.1
32	1	8576.0	0.0	0.1	0.0	0.1	3.3	0.0	3.7	3.9	4.0	4.1	3.7
64	1	14592.0	0.0	0.1	0.0	0.1	4.0	0.0	4.3	4.5	4.5	4.7	4.4
128	1	19520.0	0.0	0.1	0.0	0.1	6.2	0.0	6.5	6.6	7.9	8.3	6.5
256	1	24832.0	0.0	0.2	0.0	0.2	9.8	0.0	10.2	10.4	10.9	11.1	10.3
512	1	27235.7	0.1	0.4	0.1	1.1	17.0	0.1	18.8	18.8	18.8	18.9	18.8
1024	1	28725.7	0.1	0.4	0.1	2.0	32.9	0.2	35.6	35.7	35.7	35.8	35.6

Offline: NVIDIA A30, PyTorch with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA A30
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	126.5	0.1	0.4	0.1	0.1	7.2	0.0	7.8	8.0	8.8	9.5	7.9
2	1	234.8	0.1	0.4	0.1	0.1	7.8	0.0	8.3	9.9	10.1	10.3	8.5
4	1	431.1	0.1	0.4	0.1	0.1	8.5	0.0	8.6	10.3	10.4	10.5	9.2
8	1	860.8	0.1	0.4	0.1	0.2	8.5	0.0	8.9	10.5	10.7	10.8	9.3
16	1	1747.2	0.1	0.5	0.1	0.2	8.3	0.0	8.8	10.5	10.6	10.7	9.1
32	1	3205.8	0.1	0.4	0.1	0.2	9.1	0.0	9.8	11.2	11.3	11.4	10.0
64	1	6249.6	0.1	0.4	0.1	0.3	8.9	0.4	9.7	11.5	11.5	11.6	10.2
128	1	9216.0	0.1	0.3	0.1	0.5	8.9	3.9	13.9	14.1	14.2	14.4	13.9
256	1	11369.7	0.1	0.3	0.1	0.9	5.3	15.8	22.5	22.7	22.7	23.0	22.5
512	1	12383.8	0.1	0.3	0.1	1.6	5.4	33.8	41.3	41.5	41.6	41.7	41.3
1024	1	12849.9	0.1	0.4	0.1	3.2	5.6	70.2	79.6	80.0	80.1	80.3	79.6

Offline: NVIDIA A30, PyTorch with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA A30
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	189.0	0.1	0.3	0.0	0.1	4.8	0.0	4.6	7.4	7.4	8.5	5.3
2	1	252.9	0.1	0.4	0.1	0.1	7.2	0.0	7.9	8.0	8.0	8.1	7.9
4	1	500.0	0.1	0.4	0.1	0.1	7.3	0.0	8.0	8.0	8.0	9.2	8.0
8	1	998.0	0.1	0.3	0.1	0.1	7.4	0.0	8.0	8.0	8.1	8.2	8.0
16	1	1996.0	0.1	0.3	0.1	0.1	7.4	0.0	8.0	8.1	8.1	9.1	8.0
32	1	3750.4	0.1	0.4	0.1	0.1	7.8	0.0	8.5	8.6	8.7	10.3	8.5
64	1	7179.4	0.1	0.4	0.1	0.2	7.7	0.4	8.9	9.0	9.1	9.4	8.9
128	1	9946.0	0.1	0.3	0.1	0.3	7.3	4.8	12.8	13.3	13.6	13.7	12.8
256	1	11821.5	0.0	0.2	0.0	0.6	5.0	15.8	21.6	21.8	21.8	21.8	21.6
512	1	12825.0	0.0	0.2	0.0	0.8	5.0	33.8	40.0	40.3	40.5	40.6	39.8
1024	1	13284.7	0.0	0.2	0.0	1.8	5.3	69.7	77.3	77.7	77.8	77.9	77.1

Offline: NVIDIA DGX-1 (1x V100 32GB), NVIDIA TensorRT with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX-1 (1x V100 32GB)
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	416.5	0.1	0.2	0.1	0.1	1.8	2.4	2.5	2.5	2.6	2.4
2	1	770.6	0.1	0.3	0.1	0.2	1.9	2.6	2.6	2.7	2.7	2.6
4	1	1427.3	0.1	0.2	0.1	0.2	2.2	2.8	2.9	2.9	3.0	2.8
8	1	2604.0	0.1	0.3	0.1	0.2	2.4	3.1	3.2	3.2	3.3	3.1
16	1	4480.0	0.1	0.3	0.1	0.2	2.9	3.6	3.7	3.7	3.8	3.6
32	1	7274.7	0.1	0.2	0.1	0.2	3.9	4.4	4.5	4.5	4.6	4.4
64	1	10922.7	0.1	0.2	0.1	0.2	5.3	5.8	6.0	6.0	6.1	5.8
128	1	13744.5	0.1	0.2	0.1	0.2	8.7	9.3	9.4	9.4	9.6	9.3
256	1	17341.8	0.1	0.2	0.1	0.3	14.0	14.7	14.9	14.9	15.1	14.7
512	1	20439.0	0.1	0.2	0.1	0.5	24.1	25.0	25.1	25.2	25.6	25.0
1024	1	23410.2	0.1	0.3	0.1	0.7	42.5	43.6	43.8	43.9	44.6	43.7

Offline: NVIDIA DGX-1 (1x V100 32GB), NVIDIA TensorRT with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX-1 (1x V100 32GB)
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	406.0	0.1	0.2	0.1	0.2	1.8	2.4	2.5	2.5	2.6	2.5
2	1	775.0	0.1	0.2	0.1	0.2	2.0	2.6	2.7	2.7	2.8	2.6
4	1	1431.3	0.1	0.2	0.1	0.2	2.2	2.8	3.0	3.0	3.2	2.8
8	1	2644.0	0.1	0.2	0.1	0.1	2.5	3.0	3.1	3.1	3.1	3.0
16	1	4824.0	0.1	0.2	0.1	0.2	2.7	3.3	3.4	3.4	3.5	3.3
32	1	7637.3	0.1	0.2	0.1	0.2	3.6	4.2	4.3	4.3	4.4	4.2
64	1	10919.0	0.1	0.3	0.1	0.2	5.2	5.8	5.9	6.0	6.0	5.8
128	1	13488.5	0.1	0.2	0.1	0.2	8.8	9.4	9.7	9.8	10.0	9.5
256	1	17216.0	0.1	0.2	0.1	0.3	14.2	14.8	15.0	15.1	15.2	14.8
512	1	20596.6	0.1	0.3	0.1	0.5	23.9	24.8	25.0	25.1	25.3	24.8
1024	1	23456.8	0.1	0.2	0.1	0.7	42.6	43.7	44.3	44.4	44.9	43.6

Offline: NVIDIA DGX-1 (1x V100 32GB), PyTorch with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX-1 (1x V100 32GB)
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	134.2	0.1	0.3	0.1	0.1	6.9	0.0	8.1	8.3	8.4	9.1	7.4
2	1	271.5	0.0	0.2	0.1	0.1	6.9	0.0	7.2	8.2	8.3	8.3	7.3
4	1	524.9	0.1	0.3	0.1	0.1	7.1	0.0	8.3	8.5	8.9	9.6	7.6
8	1	1044.0	0.1	0.3	0.1	0.1	7.1	0.0	8.4	8.5	8.6	9.5	7.6
16	1	2119.5	0.1	0.3	0.1	0.1	7.0	0.0	8.2	8.4	8.5	8.8	7.5
32	1	3775.2	0.1	0.3	0.1	0.1	7.9	0.0	9.2	9.4	9.4	9.5	8.4
64	1	6424.3	0.1	0.3	0.1	0.1	7.9	1.5	9.9	10.1	10.1	10.6	9.9
128	1	8528.0	0.1	0.2	0.1	0.2	8.0	6.4	15.1	15.2	15.3	15.4	15.0
256	1	10644.4	0.1	0.3	0.1	0.3	8.0	15.3	24.1	24.3	24.3	24.7	24.0
512	1	12213.7	0.1	0.3	0.1	0.5	7.3	33.8	41.9	42.1	42.1	42.2	41.9
1024	1	13153.4	0.1	0.3	0.1	0.8	6.6	69.9	77.7	77.8	77.9	78.1	77.7

Offline: NVIDIA DGX-1 (1x V100 32GB), PyTorch with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX-1 (1x V100 32GB)
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	158.0	0.1	0.2	0.1	0.1	5.9	0.0	6.4	6.5	6.6	6.7	6.3
2	1	312.5	0.1	0.3	0.1	0.1	5.9	0.0	6.5	6.6	6.6	6.8	6.4
4	1	608.0	0.1	0.3	0.1	0.1	6.0	0.0	6.6	6.8	6.8	7.0	6.6
8	1	1208.0	0.1	0.2	0.1	0.1	6.1	0.0	6.7	6.8	6.9	6.9	6.6
16	1	2456.0	0.1	0.3	0.1	0.1	5.9	0.0	6.5	6.6	6.7	7.3	6.5
32	1	4352.0	0.1	0.3	0.1	0.1	6.8	0.0	7.3	7.4	7.5	8.1	7.3
64	1	6366.9	0.1	0.3	0.1	0.1	7.2	2.3	10.0	10.1	10.1	10.2	10.0
128	1	8544.0	0.1	0.3	0.1	0.2	7.3	7.0	14.9	15.1	15.1	15.3	15.0
256	1	10687.1	0.1	0.3	0.1	0.3	7.3	15.9	23.9	24.0	24.0	24.1	23.9
512	1	12189.3	0.1	0.3	0.1	0.5	7.2	33.9	42.0	42.1	42.1	42.2	42.0
1024	1	13153.1	0.1	0.3	0.1	0.8	7.0	69.5	77.8	77.9	77.9	78.1	77.8

Offline: NVIDIA DGX A100 (1x A100 80GB), NVIDIA TensorRT with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX A100 (1x A100 80GB)
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Network+Server Send/Recv (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	663.0	0.1	0.1	1.3	0.0	1.4	1.6	1.6	4.7	1.5
2	1	879.0	0.1	0.1	2.1	0.0	2.3	2.4	2.4	2.4	2.3
4	1	1638.0	0.1	0.1	2.2	0.0	2.4	2.5	2.5	2.5	2.4
8	1	3080.0	0.1	0.1	2.4	0.0	2.6	2.6	2.7	2.7	2.6
16	1	5808.0	0.1	0.1	2.5	0.0	2.7	2.8	2.8	2.9	2.8
32	1	10688.0	0.1	0.1	2.7	0.0	3.0	3.1	3.1	3.1	3.0
64	1	17664.0	0.1	0.1	3.4	0.0	3.6	3.8	3.9	3.9	3.6
128	1	24362.7	0.1	0.2	4.9	0.0	5.2	5.5	5.5	5.6	5.2
256	1	35136.0	0.1	0.2	6.9	0.0	7.3	7.5	7.5	7.7	7.3
512	1	49493.3	0.1	0.2	9.9	0.0	10.2	10.4	10.5	12.9	10.3
1024	1	54061.8	0.1	0.5	18.2	0.1	18.8	18.9	19.0	22.3	18.9

Offline: NVIDIA DGX A100 (1x A100 80GB), NVIDIA TensorRT with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX A100 (1x A100 80GB)
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Network+Server Send/Recv (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	716.0	0.1	0.1	1.2	1.4	1.4	1.4	2.1	1.4
2	1	878.0	0.1	0.1	2.1	2.3	2.4	2.4	2.4	2.3
4	1	1653.2	0.1	0.1	2.2	2.4	2.5	2.5	2.5	2.4
8	1	3192.0	0.1	0.1	2.3	2.5	2.5	2.6	2.6	2.5
16	1	5920.0	0.1	0.1	2.5	2.7	2.8	2.8	2.8	2.7
32	1	10624.0	0.1	0.1	2.8	3.0	3.1	3.1	3.1	3.0
64	1	18358.8	0.1	0.1	3.2	3.5	3.5	3.6	3.6	3.5
128	1	24738.4	0.1	0.2	4.8	5.2	5.3	5.3	5.4	5.2
256	1	35776.0	0.1	0.2	6.8	7.1	7.3	7.4	7.5	7.1
512	1	49834.7	0.1	0.2	9.9	10.2	10.3	10.3	11.3	10.3
1024	1	53350.4	0.1	0.4	18.6	19.1	19.2	19.3	22.4	19.2

Offline: NVIDIA DGX A100 (1x A100 80GB), PyTorch with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX A100 (1x A100 80GB)
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Network+Server Send/Recv (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	205.0	0.1	0.1	4.6	0.0	4.8	4.9	4.9	5.3	4.9
2	1	396.0	0.1	0.1	4.8	0.0	5.0	5.2	5.4	5.5	5.0
4	1	788.0	0.1	0.1	4.8	0.0	5.0	5.1	5.3	5.5	5.1
8	1	1544.0	0.1	0.1	4.9	0.0	5.1	5.4	5.5	5.6	5.2
16	1	3081.6	0.1	0.1	4.9	0.0	5.1	5.4	5.5	5.6	5.2
32	1	5802.7	0.1	0.1	5.2	0.0	5.5	5.5	5.8	5.9	5.5
64	1	10624.0	0.1	0.1	5.3	0.5	6.0	6.1	6.2	6.4	6.0
128	1	15203.4	0.1	0.2	5.3	2.8	8.4	8.6	8.7	8.9	8.4
256	1	19821.7	0.1	0.3	5.3	7.2	13.0	13.1	13.3	13.4	12.9
512	1	23123.4	0.1	0.4	5.3	16.2	22.2	22.3	22.4	22.4	22.1
1024	1	25159.9	0.1	0.9	5.7	33.9	40.7	40.8	40.9	40.9	40.6

Offline: NVIDIA DGX A100 (1x A100 80GB), PyTorch with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX A100 (1x A100 80GB)
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Network+Server Send/Recv (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	200.3	0.1	0.1	4.7	0.0	5.0	5.1	5.3	5.4	5.0
2	1	393.3	0.1	0.1	4.8	0.0	5.1	5.1	5.4	5.5	5.1
4	1	774.7	0.1	0.1	4.9	0.0	5.1	5.2	5.5	5.8	5.2
8	1	1525.3	0.1	0.1	5.0	0.0	5.2	5.5	5.6	5.7	5.2
16	1	3028.3	0.1	0.1	5.0	0.0	5.2	5.6	5.7	5.7	5.3
32	1	5696.0	0.1	0.1	5.3	0.0	5.6	5.7	5.9	6.0	5.6
64	1	10645.3	0.1	0.1	5.4	0.3	6.0	6.2	6.2	6.3	6.0
128	1	15229.0	0.2	0.2	5.4	2.6	8.4	8.6	8.7	8.8	8.4
256	1	19965.1	0.1	0.3	5.4	7.0	12.8	13.2	13.3	13.3	12.8
512	1	23319.3	0.1	0.5	5.4	15.9	21.9	22.1	22.2	22.2	21.9
1024	1	25452.5	0.1	0.9	5.8	33.3	40.2	40.4	40.5	40.6	40.2

Offline: NVIDIA T4, NVIDIA TensorRT with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA T4
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	415.0	0.1	0.4	0.1	0.2	1.6	2.4	2.5	2.5	2.5	2.4
2	1	781.6	0.1	0.4	0.1	0.2	1.7	2.5	2.6	2.6	2.6	2.5
4	1	1617.2	0.1	0.3	0.1	0.2	1.8	2.5	2.5	2.5	2.6	2.5
8	1	2998.5	0.1	0.3	0.1	0.2	2.0	2.7	2.7	2.7	2.7	2.6
16	1	4504.0	0.1	0.5	0.1	0.2	2.7	3.5	3.6	3.6	3.6	3.5
32	1	6483.2	0.1	0.5	0.1	0.2	4.0	4.9	5.0	5.0	5.0	4.9
64	1	9197.7	0.1	0.5	0.0	0.2	6.1	6.9	7.0	7.0	7.0	6.9
128	1	11136.0	0.0	0.3	0.1	0.2	10.8	11.5	11.6	11.6	11.6	11.5
256	1	12682.5	0.1	0.5	0.1	0.2	19.2	20.1	20.2	20.3	20.3	20.1
512	1	12628.1	0.1	0.5	0.1	0.4	39.5	40.5	40.7	40.7	40.8	40.5
1024	1	13054.4	0.1	0.5	0.1	0.6	77.1	78.4	78.9	79.0	79.2	78.4

Offline: NVIDIA T4, NVIDIA TensorRT with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA T4
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	455.5	0.1	0.3	0.0	0.1	1.6	2.2	2.3	2.3	2.3	2.2
2	1	872.0	0.1	0.3	0.1	0.1	1.7	2.3	2.4	2.4	2.4	2.3
4	1	1622.0	0.1	0.2	0.1	0.1	1.9	2.5	2.5	2.5	2.6	2.4
8	1	2882.6	0.1	0.4	0.1	0.1	2.0	2.8	2.9	2.9	2.9	2.8
16	1	4488.0	0.1	0.5	0.1	0.1	2.8	3.6	3.6	3.6	3.6	3.5
32	1	6592.0	0.1	0.5	0.1	0.1	4.1	4.8	4.9	4.9	4.9	4.8
64	1	9341.7	0.1	0.4	0.1	0.1	6.1	6.8	6.9	6.9	7.0	6.8
128	1	10899.5	0.1	0.5	0.1	0.1	10.9	11.7	11.8	11.8	11.8	11.7
256	1	12681.3	0.1	0.4	0.1	0.2	19.3	20.1	20.3	20.3	20.4	20.1
512	1	12651.9	0.1	0.5	0.1	0.3	39.5	40.4	40.6	40.7	40.8	40.4
1024	1	13003.2	0.1	0.4	0.1	0.6	77.3	78.6	79.0	79.2	79.3	78.6

Offline: NVIDIA T4, PyTorch with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA T4
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	127.8	0.1	0.6	0.2	0.1	6.8	0.0	7.7	8.6	8.9	9.4	7.8
2	1	251.0	0.1	0.6	0.1	0.1	6.9	0.0	7.8	8.8	9.2	9.6	7.9
4	1	498.9	0.1	0.6	0.2	0.1	7.0	0.0	8.0	8.5	9.1	9.3	8.0
8	1	975.8	0.1	0.6	0.2	0.1	7.1	0.0	8.1	8.7	8.8	9.4	8.2
16	1	1913.6	0.1	0.6	0.2	0.2	7.2	0.1	8.3	8.8	8.9	9.2	8.3
32	1	2820.9	0.1	0.6	0.1	0.2	7.5	2.8	11.3	11.6	11.6	11.8	11.3
64	1	3366.1	0.1	0.6	0.1	0.2	8.1	9.9	18.9	19.3	19.4	19.7	19.0
128	1	3786.8	0.1	0.6	0.1	0.1	4.5	28.4	33.8	34.1	34.1	34.3	33.8
256	1	3948.1	0.1	0.6	0.1	0.2	4.4	59.4	64.7	65.5	65.8	66.0	64.7
512	1	4079.3	0.1	0.6	0.1	0.4	4.5	119.7	125.2	127.1	127.6	128.3	125.3
1024	1	4095.5	0.1	0.6	0.1	0.8	4.5	243.8	250.0	251.7	252.0	252.6	249.9

Offline: NVIDIA T4, PyTorch with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA T4
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
1	1	136.0	0.1	0.5	0.1	0.1	6.6	0.0	7.3	7.9	8.1	8.5	7.3
2	1	242.8	0.1	0.6	0.1	0.1	7.2	0.0	8.1	8.7	9.0	9.4	8.2
4	1	479.9	0.1	0.6	0.2	0.1	7.3	0.0	8.2	8.9	9.2	9.6	8.3
8	1	943.8	0.1	0.6	0.2	0.2	7.4	0.0	8.4	9.1	9.2	9.5	8.4
16	1	2239.4	0.1	0.5	0.1	0.1	4.2	2.1	7.1	7.2	7.2	7.3	7.1
32	1	2975.5	0.1	0.5	0.1	0.1	4.5	5.5	10.7	10.9	10.9	10.9	10.7
64	1	3436.1	0.1	0.5	0.1	0.1	5.7	12.0	18.6	19.1	19.3	19.5	18.6
128	1	3786.8	0.1	0.5	0.1	0.2	5.7	27.1	33.7	34.0	34.1	34.2	33.7
256	1	3963.6	0.1	0.6	0.1	0.3	7.0	56.4	64.5	65.2	65.4	65.8	64.5
512	1	4103.6	0.1	0.6	0.1	0.4	6.1	117.4	124.6	126.3	126.6	127.1	124.7
1024	1	4120.2	0.1	0.4	0.1	1.0	7.1	239.7	248.3	250.3	250.9	251.8	248.3

Online scenario

The online scenario assumes the client and server are located on different hosts. The tests uses:

tensors are passed through HTTP from client to server
concurrent requests are send from client to server, the final batch is created on server side

Online: NVIDIA A30, NVIDIA TensorRT with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA A30
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	15360.0	0.1	0.3	3.6	0.1	4.0	0.0	8.2	8.3	8.4	8.7	8.2
16	16	15696.0	0.1	0.5	8.5	0.2	6.9	0.1	16.4	20.2	20.4	22.2	16.2
16	24	17072.0	0.1	0.8	10.8	0.2	10.2	0.1	22.3	30.5	31.9	33.4	22.2
16	32	16640.0	0.1	1.0	14.5	0.3	14.4	0.1	32.0	36.1	36.6	39.2	30.3
16	40	19120.0	0.1	1.6	13.8	0.3	17.2	0.1	34.9	43.8	46.3	48.5	33.1
16	48	15984.0	0.1	1.7	16.1	0.4	27.9	0.1	49.2	52.5	53.0	53.5	46.2
16	56	16528.0	0.1	1.9	21.7	0.4	26.3	0.0	52.6	56.2	56.4	57.0	50.4
16	64	16256.0	0.1	2.2	30.6	0.3	27.0	0.0	63.8	66.2	66.5	66.9	60.3
16	72	17696.0	0.1	2.5	34.4	0.4	25.8	0.0	65.5	68.9	69.6	70.3	63.3
16	80	16976.0	0.1	2.1	38.8	0.4	32.0	0.1	78.7	82.1	82.6	82.9	73.4
16	88	20464.0	0.1	2.7	32.0	0.6	30.5	0.0	62.7	79.0	80.0	80.8	66.0
16	96	20064.0	0.1	2.9	39.5	0.6	31.3	0.1	75.6	79.8	80.6	81.0	74.3
16	104	20768.0	0.1	3.9	38.1	0.7	34.1	0.1	79.3	82.7	83.3	83.7	77.0
16	112	22032.0	0.1	3.5	43.1	0.7	33.1	0.1	83.0	84.1	84.3	84.5	80.5
16	120	21584.0	0.1	3.4	49.9	0.8	33.0	0.1	92.2	93.1	93.2	94.2	87.3
16	128	23280.0	0.1	2.4	41.9	0.7	37.3	0.1	84.4	94.2	103.3	104.8	82.5
16	136	23232.0	0.1	3.6	52.6	0.7	32.7	0.1	92.4	93.4	93.7	94.4	89.7
16	144	24224.0	0.1	3.7	50.7	0.8	34.6	0.1	92.8	95.0	96.1	102.7	90.0
16	152	23232.0	0.1	2.7	64.5	0.7	33.4	0.1	102.5	112.5	117.3	123.3	101.6
16	160	21040.0	0.1	4.6	72.2	0.8	38.0	0.1	127.8	130.2	130.8	150.9	115.8
16	168	23848.2	0.1	4.5	66.3	0.9	35.8	0.1	109.8	111.1	111.3	111.7	107.7
16	176	23280.0	0.1	4.8	60.5	0.8	40.5	0.1	109.4	117.4	130.9	133.3	106.8
16	184	21594.4	0.3	2.8	87.2	0.9	36.6	0.1	130.0	145.0	145.2	146.6	127.8
16	192	20816.0	0.3	3.5	99.0	0.9	36.5	0.1	145.1	147.1	148.0	165.5	140.3
16	200	20224.0	0.3	3.5	104.1	0.8	37.4	0.1	145.7	147.6	148.1	165.8	146.1
16	208	21744.0	0.2	3.9	98.5	1.0	39.0	0.2	145.8	150.7	166.3	168.3	142.8
16	216	20112.0	0.4	2.7	117.8	0.8	34.0	0.2	156.1	157.2	157.4	157.8	156.0
16	224	23504.0	0.4	5.2	99.3	0.9	39.3	0.2	147.0	151.3	167.6	168.0	145.3
16	232	24352.0	0.5	3.6	93.6	1.0	41.3	0.2	144.9	148.2	167.3	169.5	140.2
16	240	25760.0	0.4	2.8	89.5	0.9	45.9	0.1	140.8	159.9	171.6	181.1	139.7
16	248	23872.0	0.5	2.5	114.7	1.0	34.7	0.1	156.6	158.2	158.8	164.2	153.4
16	256	24960.0	0.5	3.4	105.6	1.1	40.0	0.1	152.3	173.8	182.2	188.4	150.8

Online: NVIDIA A30, NVIDIA TensorRT with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA A30
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	15104.0	0.1	0.5	3.6	0.1	4.0	0.1	8.4	8.4	8.5	8.5	8.4
16	16	15328.0	0.1	0.7	8.5	0.2	7.1	0.1	16.8	20.8	21.1	23.1	16.6
16	24	17072.0	0.1	1.2	10.4	0.3	10.2	0.1	23.6	30.2	30.6	32.2	22.3
16	32	16176.0	0.1	1.8	14.0	0.3	14.4	0.1	33.5	35.9	36.0	36.5	30.6
16	40	18288.0	0.1	1.7	17.3	0.3	14.5	0.1	35.8	39.6	39.9	41.3	34.0
16	48	17136.0	0.1	2.0	18.0	0.4	22.8	0.1	45.6	51.5	52.5	53.9	43.4
16	56	16992.0	0.1	2.9	22.3	0.5	26.1	0.1	55.4	56.8	57.2	57.5	51.9
16	64	17552.0	0.1	2.8	25.2	0.5	26.7	0.1	56.2	65.9	66.3	66.6	55.4
16	72	19552.0	0.1	3.3	28.8	0.6	25.4	0.1	65.2	66.6	67.0	69.4	58.3
16	80	21072.0	0.1	3.2	26.2	0.7	29.3	0.2	62.3	65.4	66.0	66.3	59.7
16	88	19392.0	0.1	2.3	36.0	0.8	30.6	0.1	68.1	82.9	83.7	84.1	69.9
16	96	19168.0	0.1	3.5	38.0	0.7	33.9	0.2	79.2	80.2	80.6	83.3	76.3
16	104	17920.0	0.1	3.1	51.8	0.8	32.2	0.2	92.5	93.4	93.8	94.3	88.2
16	112	21296.0	0.1	3.8	39.7	1.0	34.7	0.2	83.4	84.3	84.8	104.0	79.4
16	120	22032.0	0.1	3.1	45.0	0.8	33.0	0.2	82.9	93.0	93.5	94.7	82.2
16	128	21882.1	0.1	3.1	53.6	0.9	32.5	0.2	93.0	93.6	93.8	94.4	90.4
16	136	25552.0	0.1	3.8	41.3	1.0	37.3	0.2	83.9	93.7	105.3	108.0	83.7
16	144	21904.0	0.1	5.5	60.9	0.8	33.6	0.2	103.9	113.3	113.4	132.9	101.1
16	152	21456.0	0.1	3.6	66.5	0.8	35.6	0.2	109.4	110.0	110.2	110.5	106.8
16	160	23040.0	0.2	3.3	59.4	0.9	40.4	0.2	109.7	129.7	130.1	130.9	104.3
16	168	19600.0	0.2	0.9	88.8	0.8	34.2	0.1	128.7	131.4	144.9	145.6	125.0
16	176	20880.0	0.2	4.6	84.9	0.9	34.9	0.1	129.2	130.0	130.6	133.1	125.6
16	184	22409.6	0.2	6.5	78.3	1.1	40.1	0.1	129.6	146.7	147.9	149.9	126.2
16	192	19456.0	0.2	3.9	101.8	0.9	35.5	0.2	145.9	147.1	147.3	147.7	142.4
16	200	20155.8	0.2	3.7	105.2	1.0	35.6	0.1	146.6	147.3	147.7	148.3	145.9
16	208	21040.0	0.3	3.8	100.1	0.8	40.2	0.1	145.7	165.6	166.2	172.1	145.4
16	216	20784.0	0.4	2.7	117.4	0.8	34.0	0.1	155.5	156.4	156.6	156.9	155.3
16	224	23344.0	0.5	3.6	99.0	0.8	41.6	0.1	149.9	157.3	173.8	190.6	145.7
16	232	21760.0	0.4	3.2	117.4	0.9	34.2	0.2	156.7	157.3	157.5	158.1	156.3
16	240	20784.0	0.2	4.4	126.7	1.0	34.1	0.1	166.6	169.1	169.5	169.8	166.6
16	248	26352.0	0.3	3.7	107.7	1.1	32.3	0.1	146.9	149.2	163.2	169.4	145.3
16	256	23408.0	0.4	4.9	116.1	1.1	42.3	0.1	163.0	197.6	201.1	204.3	164.9

Online: NVIDIA A30, PyTorch with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA A30
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	5528.0	0.1	0.8	8.1	0.5	13.1	0.3	26.2	28.1	28.7	30.3	22.8
16	16	9120.0	0.1	0.6	10.3	0.7	10.5	5.3	30.8	33.5	34.7	35.8	27.5
16	24	10384.0	0.1	0.8	14.0	1.1	10.6	9.3	39.3	42.4	43.1	46.0	35.8
16	32	11076.9	0.1	1.2	18.8	1.4	10.2	13.2	48.5	51.1	51.5	54.6	44.9
16	40	11328.0	0.1	2.0	21.6	2.3	10.7	18.4	58.8	62.0	63.2	67.5	55.1
16	48	11296.0	0.1	3.2	25.3	5.1	9.3	22.1	67.7	73.3	76.0	79.1	65.1
16	56	11440.0	0.1	3.3	29.6	5.0	9.9	26.1	77.3	82.5	83.9	92.3	74.0
16	64	11600.0	0.1	2.9	35.5	7.6	9.3	29.0	88.5	95.2	98.9	113.5	84.4
16	72	11316.7	0.1	4.3	38.1	16.0	7.7	29.3	99.4	103.1	123.0	125.8	95.5
16	80	11664.0	0.1	4.0	46.0	18.0	7.5	28.0	108.4	112.7	116.1	126.0	103.7
16	88	11472.0	0.1	3.0	47.8	19.8	8.2	34.4	119.7	128.6	131.9	135.5	113.3
16	96	11760.0	0.1	4.4	53.1	22.1	7.3	36.1	128.7	131.5	132.1	133.3	123.1
16	104	11840.0	0.1	5.4	59.4	5.7	9.8	51.0	132.7	138.7	138.9	175.8	131.5
16	112	11728.0	0.1	4.2	59.1	16.9	8.8	51.3	146.7	162.7	164.0	168.4	140.3
16	120	11796.2	0.1	5.3	54.2	20.6	7.6	61.4	155.3	164.2	172.6	173.1	149.2
16	128	12272.0	0.1	6.3	64.6	16.7	7.6	61.5	165.7	175.9	194.4	197.7	156.8
16	136	11680.0	0.1	6.0	74.7	33.5	6.6	48.7	178.5	183.0	183.9	186.4	169.5
16	144	11408.0	0.1	5.5	76.6	33.3	7.1	55.4	190.7	198.8	203.2	204.6	178.0
16	152	11456.0	0.1	4.7	87.4	28.8	7.2	60.8	193.9	199.5	200.2	201.1	189.0
16	160	11444.6	0.2	4.7	94.3	24.3	7.0	67.1	198.0	199.4	199.5	199.6	197.5
16	168	11040.0	0.1	7.5	89.1	35.2	6.8	70.2	214.2	220.1	222.9	225.2	208.9
16	176	11536.0	0.2	4.7	97.1	39.1	7.0	67.9	221.9	239.7	242.6	255.8	216.0
16	184	11136.0	0.1	6.5	101.3	41.8	7.1	67.2	231.3	236.7	240.0	240.4	224.1
16	192	11376.0	0.2	6.4	106.9	47.0	7.6	68.9	245.5	252.9	256.1	265.9	237.1
16	200	11840.0	0.3	5.0	110.3	46.4	7.0	72.7	255.0	262.0	267.0	267.9	241.8
16	208	11680.0	0.2	5.3	122.0	37.8	7.6	78.0	252.1	254.0	309.6	311.0	250.9
16	216	11280.0	0.2	6.0	151.5	41.8	6.9	59.4	270.5	279.9	283.2	283.9	265.8
16	224	11152.0	0.4	5.9	127.1	51.8	7.0	79.1	280.9	283.7	284.6	285.1	271.3
16	232	10848.0	0.2	5.0	158.1	41.7	7.8	72.7	287.4	306.0	315.8	316.9	285.5
16	240	11088.0	0.2	10.1	166.0	34.4	7.2	78.0	296.1	318.6	348.7	354.4	295.8
16	248	10485.5	0.3	5.8	174.3	40.1	7.2	75.4	307.6	316.7	322.0	323.7	303.2
16	256	11168.0	0.4	4.5	178.3	45.8	7.1	77.2	320.5	341.6	342.6	348.6	313.2

Online: NVIDIA A30, PyTorch with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA A30
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	6544.0	0.1	0.5	7.0	0.4	8.8	2.6	22.1	23.9	24.5	25.8	19.3
16	16	9456.0	0.1	0.6	9.7	0.8	8.7	6.9	30.5	32.8	33.4	34.2	26.6
16	24	10704.0	0.1	0.8	13.8	0.9	8.5	11.3	39.0	41.9	42.2	42.7	35.4
16	32	11472.0	0.1	0.9	18.3	1.3	8.4	15.0	48.1	50.2	51.1	51.9	44.0
16	40	11568.0	0.1	1.3	21.8	1.5	8.6	20.1	57.7	60.4	60.8	62.3	53.4
16	48	12000.0	0.1	2.8	24.6	1.3	8.7	25.6	66.3	68.3	68.6	69.3	63.1
16	56	12048.0	0.1	3.1	20.9	1.6	8.3	37.6	75.2	77.2	77.9	78.8	71.5
16	64	11824.0	0.1	2.8	29.1	1.8	8.5	38.8	85.2	87.8	88.4	89.3	81.0
16	72	11888.0	0.1	2.2	36.1	2.0	8.8	40.8	93.9	96.0	96.5	101.8	90.0
16	80	11712.0	0.1	3.7	44.4	10.6	8.1	36.3	107.1	119.0	121.6	128.2	103.3
16	88	12240.0	0.1	4.5	44.7	5.7	7.9	48.6	115.8	119.8	130.2	153.3	111.5
16	96	11888.0	0.1	3.0	48.8	10.6	7.8	50.0	127.1	135.0	152.9	179.4	120.3
16	104	12096.0	0.1	3.4	59.4	10.2	7.4	48.6	134.8	139.1	146.7	158.2	129.1
16	112	11408.0	0.1	5.3	57.8	27.2	5.8	46.0	146.4	147.8	149.7	155.4	142.2
16	120	11812.2	0.1	6.7	63.8	14.0	6.8	57.3	153.3	157.9	160.4	161.9	148.7
16	128	11632.0	0.1	4.9	69.6	15.9	7.3	59.2	163.6	177.1	180.0	205.3	157.0
16	136	11620.4	0.1	3.5	76.0	9.8	8.2	68.3	172.9	182.9	195.5	196.8	166.0
16	144	11824.0	0.1	3.3	81.3	24.9	7.0	60.9	181.9	187.9	210.9	211.8	177.5
16	152	12032.0	0.1	3.8	85.9	22.9	7.1	67.1	192.9	219.2	239.1	252.4	187.0
16	160	12048.0	0.1	4.0	89.0	21.3	6.5	72.7	199.7	206.4	230.8	246.6	193.7
16	168	11456.0	0.1	4.4	93.2	30.2	5.7	70.5	208.4	209.8	211.8	212.0	204.3
16	176	11584.0	0.2	5.7	100.5	38.5	6.5	64.0	219.8	221.4	222.1	223.7	215.4
16	184	12096.0	0.2	5.6	103.2	40.9	6.0	69.2	230.2	233.5	233.8	233.9	225.0
16	192	11200.0	0.2	6.2	107.5	35.4	6.5	79.3	241.6	251.3	254.8	255.0	235.0
16	200	10880.0	0.3	5.0	113.9	31.7	7.0	88.9	255.2	267.0	294.9	296.2	246.8
16	208	11984.0	0.1	6.4	116.5	45.0	6.2	78.1	261.3	267.0	268.0	268.4	252.3
16	216	11632.0	0.2	6.9	121.8	39.8	6.8	90.8	275.9	280.9	282.2	282.5	266.4
16	224	11140.9	0.3	6.6	128.6	49.4	6.8	84.3	284.0	288.6	294.6	295.2	275.8
16	232	11568.0	0.2	5.2	162.0	15.2	8.1	89.0	285.6	312.9	315.5	335.5	279.7
16	240	11696.0	0.3	5.3	167.3	40.9	6.2	75.4	300.4	309.2	317.6	318.4	295.3
16	248	11040.0	0.2	8.0	174.9	32.4	7.1	82.8	307.4	327.0	370.7	371.9	305.6
16	256	10528.0	0.5	4.0	179.5	42.6	6.8	80.8	321.4	325.7	326.0	327.2	314.2

Online: NVIDIA DGX-1 (1x V100 32GB), NVIDIA TensorRT with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX-1 (1x V100 32GB)
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	11776.0	0.1	0.5	4.7	0.2	5.3	0.0	10.8	10.9	11.0	11.0	10.7
16	16	11360.0	0.1	0.7	11.7	0.2	9.4	0.0	23.1	28.6	32.0	32.2	22.1
16	24	12656.0	0.1	1.0	15.8	0.3	12.8	0.0	33.8	34.3	34.4	37.7	30.1
16	32	11968.0	0.1	1.6	20.9	0.4	18.8	0.0	44.2	48.0	48.1	48.7	41.8
16	40	14640.0	0.1	1.5	20.9	0.4	19.6	0.0	47.6	48.0	48.0	48.1	42.6
16	48	13280.0	0.1	1.6	32.8	0.4	21.3	0.0	62.9	63.4	63.5	63.6	56.3
16	56	13232.0	0.1	1.9	28.4	0.6	33.8	0.0	66.9	71.8	72.2	72.3	64.8
16	64	12656.0	0.1	1.9	42.4	0.6	32.3	0.0	82.2	83.0	83.6	83.8	77.3
16	72	16671.3	0.1	2.0	40.8	0.5	24.0	0.0	73.4	74.0	83.6	84.0	67.5
16	80	16384.0	0.1	2.1	36.3	0.6	34.6	0.1	76.8	77.3	77.4	77.6	73.7
16	88	13728.0	0.1	2.3	53.4	0.6	38.5	0.0	100.5	101.3	101.5	101.7	95.0
16	96	15104.0	0.1	3.0	53.7	0.7	39.6	0.1	101.2	101.8	102.0	102.2	97.1
16	104	14512.0	0.1	2.0	66.6	0.7	38.5	0.1	111.1	111.5	111.7	111.9	107.9
16	112	18464.0	0.1	3.0	49.7	1.0	40.8	0.1	96.6	101.7	101.9	102.2	94.7
16	120	17760.0	0.1	2.9	63.4	1.2	37.7	0.1	112.1	113.4	113.8	113.9	105.4
16	128	17808.0	0.1	3.9	64.6	0.9	39.5	0.1	111.7	112.3	112.5	112.5	109.0
16	136	16848.0	0.1	2.7	74.9	0.8	41.1	0.1	129.9	130.6	130.7	130.7	119.7
16	144	19216.0	0.1	3.7	66.2	1.0	38.9	0.1	112.5	113.3	113.5	114.1	110.1
16	152	20864.0	0.1	4.3	65.4	1.0	39.1	0.2	112.3	113.4	113.7	114.9	110.2
16	160	18288.0	0.1	3.8	81.3	1.2	42.7	0.1	131.4	133.1	134.3	135.1	129.2
16	168	19152.0	0.2	3.1	81.6	1.1	42.6	0.1	131.2	131.6	131.7	131.8	128.7
16	176	15152.0	0.2	2.5	127.3	0.9	42.8	0.1	174.9	175.3	175.4	175.4	173.9
16	184	15824.0	0.1	3.9	126.7	1.0	42.8	0.1	175.5	176.1	176.3	176.4	174.6
16	192	18096.0	0.2	3.0	113.1	1.0	40.2	0.1	155.7	174.7	174.9	175.0	157.6
16	200	18128.0	0.2	3.1	121.0	1.1	39.1	0.1	165.0	165.9	166.2	166.6	164.7
16	208	16720.0	0.3	3.1	127.9	1.2	42.9	0.2	176.3	178.0	178.9	179.2	175.5
16	216	18221.8	0.4	2.4	127.4	1.1	42.6	0.1	174.9	175.2	175.3	175.4	174.0
16	224	18944.0	0.3	3.1	127.4	1.1	42.8	0.1	175.8	176.3	176.4	176.5	174.9
16	232	19484.5	0.4	3.3	126.9	1.2	42.7	0.1	175.2	176.5	176.8	177.2	174.7
16	240	17696.0	0.5	2.1	147.7	1.2	40.8	0.1	199.8	200.7	200.8	201.1	192.3
16	248	17856.0	0.5	3.0	150.1	1.1	41.3	0.1	199.8	201.0	201.2	201.5	196.1
16	256	17712.0	0.6	2.6	155.2	1.2	41.4	0.2	201.5	202.3	202.6	202.7	201.2

Online: NVIDIA DGX-1 (1x V100 32GB), NVIDIA TensorRT with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX-1 (1x V100 32GB)
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	12083.9	0.1	0.4	4.6	0.2	5.1	0.0	10.5	10.7	10.7	10.8	10.5
16	16	11248.0	0.1	0.7	11.3	0.2	10.1	0.0	23.6	28.8	32.4	32.7	22.5
16	24	12048.0	0.1	0.8	15.3	0.3	14.0	0.0	32.5	38.9	42.4	42.7	30.6
16	32	13808.0	0.1	1.0	14.8	0.3	19.3	0.1	38.6	42.5	42.6	44.0	35.5
16	40	14160.0	0.1	1.8	22.2	0.4	19.7	0.0	44.3	53.9	54.1	57.7	44.1
16	48	13664.0	0.1	2.1	25.4	0.6	27.1	0.0	58.5	67.6	68.2	68.3	55.3
16	56	14624.0	0.1	1.4	34.6	0.5	22.1	0.0	63.5	63.8	63.8	74.0	58.8
16	64	18784.0	0.1	1.7	27.6	0.5	22.9	0.0	53.9	58.2	58.5	63.6	52.7
16	72	15584.0	0.1	2.8	33.5	0.6	34.3	0.0	76.2	77.3	77.4	77.6	71.3
16	80	14000.0	0.1	2.2	52.8	0.6	32.8	0.0	91.7	92.7	92.8	92.8	88.4
16	88	13760.0	0.1	2.4	55.0	0.6	38.9	0.1	100.5	101.6	101.7	102.0	96.9
16	96	18864.0	0.1	2.8	41.3	0.8	33.8	0.1	82.1	83.0	83.3	83.4	78.8
16	104	18000.0	0.1	3.0	52.9	0.7	32.7	0.1	91.9	92.8	92.9	93.0	89.4
16	112	16896.0	0.1	3.3	56.5	0.9	39.1	0.1	102.0	103.7	111.8	112.4	100.0
16	120	20144.0	0.1	3.2	52.5	0.8	33.6	0.1	92.7	93.7	93.8	93.9	90.3
16	128	19024.0	0.1	2.9	55.0	1.0	40.4	0.1	101.8	102.9	103.1	103.2	99.5
16	136	20560.0	0.1	3.8	55.1	1.0	39.4	0.1	101.8	102.9	103.0	103.2	99.5
16	144	17264.0	0.2	2.7	81.1	1.0	42.5	0.1	130.5	131.2	131.3	131.7	127.6
16	152	18352.0	0.2	2.8	82.8	0.9	37.6	0.1	125.2	125.5	125.6	125.7	124.4
16	160	16016.0	0.1	1.0	99.0	0.8	37.6	0.1	135.9	154.3	154.3	154.4	138.7
16	168	19200.0	0.1	3.7	81.0	1.1	42.6	0.2	131.1	132.0	132.2	132.3	128.7
16	176	16480.0	0.1	2.5	112.7	0.9	40.8	0.1	156.3	174.0	174.2	174.3	157.1
16	184	16528.0	0.2	4.1	120.3	1.0	41.3	0.1	174.3	174.9	175.1	175.6	167.1
16	192	18512.0	0.3	2.3	109.9	1.1	40.8	0.1	156.5	158.0	158.5	158.7	154.6
16	200	16735.3	0.2	3.0	126.4	1.0	42.7	0.1	174.2	174.9	175.1	175.2	173.5
16	208	17584.0	0.3	2.9	126.9	1.1	42.5	0.1	175.0	175.4	175.5	176.0	173.9
16	216	18301.7	0.4	2.6	127.2	1.1	42.5	0.1	174.8	175.1	175.2	175.4	174.0
16	224	19952.0	0.4	2.6	127.2	1.1	39.1	0.1	170.7	172.2	172.5	173.2	170.6
16	232	19536.0	0.5	2.6	127.0	1.2	42.5	0.1	174.8	175.4	175.5	175.7	173.9
16	240	18592.0	0.4	2.9	144.2	1.3	41.5	0.1	190.5	191.6	191.8	192.1	190.3
16	248	17952.0	0.3	3.3	154.6	1.1	40.2	0.1	200.4	201.1	201.4	202.0	199.8
16	256	19616.0	0.5	2.8	144.7	1.3	41.3	0.1	190.8	192.4	192.6	193.2	190.6

Online: NVIDIA DGX-1 (1x V100 32GB), PyTorch with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX-1 (1x V100 32GB)
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	5008.0	0.1	0.6	9.4	0.4	11.3	3.7	29.2	30.5	31.3	32.9	25.5
16	16	7016.0	0.1	0.7	13.5	0.8	11.7	8.9	41.2	42.9	43.4	44.2	35.7
16	24	8560.0	0.1	1.0	17.5	1.0	11.9	12.7	49.4	51.3	51.9	53.1	44.2
16	32	9264.0	0.1	1.1	21.4	1.4	11.9	17.0	57.9	59.1	59.3	59.6	52.9
16	40	10336.0	0.1	1.9	23.2	1.5	12.0	22.3	65.8	67.6	67.9	68.2	60.9
16	48	10064.0	0.1	2.6	22.0	1.7	11.8	32.6	75.7	76.6	76.7	77.4	70.8
16	56	10512.0	0.1	2.5	20.1	1.8	11.6	44.8	85.6	86.8	87.8	88.0	80.9
16	64	10848.0	0.1	3.1	30.1	1.9	11.7	42.2	93.8	95.9	96.0	99.7	89.2
16	72	10800.0	0.1	2.9	22.0	2.0	11.3	61.7	104.0	104.8	105.6	107.4	99.8
16	80	10976.0	0.1	2.8	38.7	2.2	11.3	52.2	111.6	112.5	113.3	116.0	107.3
16	88	11200.0	0.1	3.4	47.7	3.1	11.7	50.9	120.7	122.2	124.2	124.7	116.8
16	96	11152.0	0.1	2.8	54.7	3.3	11.0	54.2	130.4	132.2	133.0	133.9	126.1
16	104	11312.0	0.1	4.2	60.6	7.2	12.2	51.5	138.5	144.9	161.8	173.3	135.8
16	112	11216.0	0.1	4.6	67.1	3.2	10.5	60.7	150.1	151.5	152.3	154.1	146.2
16	120	10736.0	0.1	4.6	73.0	10.8	10.3	58.1	161.5	162.4	166.4	173.6	157.0
16	128	11504.0	0.1	3.5	77.2	7.0	9.8	66.2	168.8	171.6	172.7	186.1	163.8
16	136	11120.0	0.1	4.5	81.4	8.8	10.3	68.5	177.7	179.5	181.3	191.2	173.5
16	144	11808.0	0.1	4.7	84.3	8.4	10.7	73.0	185.0	193.4	196.4	202.1	181.2
16	152	11168.0	0.1	3.7	91.8	28.3	8.6	63.1	199.6	203.2	203.3	209.8	195.7
16	160	11392.0	0.1	5.2	84.7	21.9	9.6	81.9	205.7	220.0	248.4	248.8	203.4
16	168	11696.0	0.1	4.9	103.6	10.9	10.1	82.6	216.4	224.8	269.6	270.7	212.1
16	176	10912.0	0.1	5.9	105.3	30.6	9.9	73.6	230.7	235.1	235.4	235.7	225.3
16	184	11312.0	0.2	4.2	110.4	28.5	9.5	82.6	239.8	248.2	271.9	272.2	235.3
16	192	10992.0	0.1	5.4	113.3	43.4	8.6	70.0	246.1	248.0	248.3	248.8	241.0
16	200	11360.0	0.1	5.8	116.5	36.6	9.9	77.5	251.4	259.3	272.8	273.2	246.4
16	208	11360.0	0.1	6.1	122.2	43.4	8.5	77.2	259.1	263.0	265.2	265.9	257.6
16	216	11296.0	0.3	3.3	129.2	37.6	8.7	88.9	272.2	275.7	275.9	276.3	267.9
16	224	10800.0	0.2	5.2	132.7	43.4	8.3	86.3	277.4	281.9	282.2	282.9	276.1
16	232	11184.0	0.4	3.2	170.0	12.8	10.5	91.9	276.9	334.5	335.1	335.5	288.8
16	240	10992.0	0.4	6.2	175.9	27.0	9.4	84.9	301.9	342.6	348.0	348.2	303.8
16	248	10432.0	0.4	3.8	179.2	12.9	10.8	98.1	314.7	356.4	376.4	377.8	305.2
16	256	10896.0	0.5	3.7	185.5	38.1	8.6	83.4	323.5	329.8	332.4	332.7	319.6

Online: NVIDIA DGX-1 (1x V100 32GB), PyTorch with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX-1 (1x V100 32GB)
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	4992.0	0.1	0.6	9.5	0.4	11.2	3.6	28.9	29.9	30.2	32.2	25.3
16	16	7192.0	0.1	0.7	12.8	0.9	11.8	8.9	41.1	43.1	43.5	44.2	35.2
16	24	8496.0	0.1	0.9	16.1	1.1	11.7	13.7	49.2	51.3	52.5	53.4	43.6
16	32	9264.0	0.1	1.1	19.2	1.8	13.1	17.0	57.4	58.9	59.0	60.7	52.2
16	40	9808.0	0.1	1.4	21.5	1.8	13.1	23.5	66.0	66.4	66.5	66.6	61.4
16	48	10528.0	0.1	3.2	18.6	1.6	11.6	36.3	75.6	77.1	78.3	78.6	71.3
16	56	10480.0	0.1	2.9	20.1	1.7	11.5	44.5	85.7	86.5	86.6	87.4	80.8
16	64	10352.0	0.1	2.7	21.9	2.0	11.3	51.6	94.4	95.7	96.5	97.0	89.6
16	72	10864.0	0.1	3.3	24.1	2.2	11.6	58.0	103.6	105.6	106.1	107.1	99.4
16	80	10992.0	0.1	2.7	35.9	2.3	11.2	54.2	111.0	111.9	112.8	115.5	106.3
16	88	11648.0	0.1	3.1	46.1	2.3	11.4	53.5	120.3	121.4	122.1	125.9	116.5
16	96	11140.9	0.1	3.7	55.3	2.6	11.3	52.6	129.6	131.3	133.1	138.9	125.6
16	104	11280.0	0.1	3.2	61.2	3.1	10.5	57.0	138.8	140.7	140.7	144.1	135.1
16	112	11824.0	0.1	3.9	65.2	3.6	11.0	60.1	147.9	149.8	150.2	154.3	143.8
16	120	10864.0	0.1	3.6	71.2	4.6	11.2	62.9	157.6	158.7	159.4	166.0	153.5
16	128	11552.0	0.1	4.7	75.8	5.0	11.0	66.6	166.2	170.8	174.3	177.3	163.0
16	136	11152.0	0.1	5.0	81.2	12.7	9.5	66.0	177.9	181.8	187.7	194.7	174.5
16	144	11008.0	0.1	4.1	87.5	25.8	8.6	61.2	191.5	193.4	193.6	195.5	187.3
16	152	10992.0	0.1	6.1	89.5	18.9	9.0	71.5	200.3	207.5	207.7	208.1	195.1
16	160	10656.0	0.1	5.5	91.2	30.9	8.8	68.7	210.2	215.1	215.6	221.5	205.3
16	168	11024.0	0.1	4.8	96.1	34.5	8.6	70.2	219.3	224.1	224.8	225.3	214.3
16	176	10864.0	0.1	4.7	101.8	36.7	8.4	70.7	227.6	229.0	229.2	229.3	222.4
16	184	10896.0	0.1	5.4	107.4	38.1	8.5	73.6	237.6	242.9	243.1	244.1	233.2
16	192	10992.0	0.1	3.2	115.2	20.8	10.0	93.2	244.9	257.2	280.7	280.9	242.5
16	200	11552.0	0.2	4.9	118.6	44.4	8.5	73.4	254.1	257.2	257.2	257.6	250.0
16	208	11236.8	0.2	1.9	124.8	21.1	10.8	101.0	263.9	281.4	287.4	288.0	259.8
16	216	11504.0	0.2	4.4	126.3	48.3	8.4	79.7	273.0	275.6	275.9	276.0	267.3
16	224	11056.0	0.4	4.7	131.6	28.3	9.9	102.3	285.1	290.2	304.5	304.8	277.3
16	232	10528.0	0.3	4.2	169.8	36.7	9.1	73.4	295.4	317.8	318.4	319.0	293.5
16	240	10485.5	0.2	4.6	173.9	38.0	8.4	76.7	302.6	303.9	304.2	304.7	301.8
16	248	11168.0	0.3	6.6	175.1	32.5	9.0	88.1	314.0	331.7	333.7	334.1	311.6
16	256	10384.0	0.4	3.3	184.6	40.0	8.4	82.2	318.6	321.9	322.1	322.4	318.8

Online: NVIDIA DGX A100 (1x A100 80GB), NVIDIA TensorRT with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX A100 (1x A100 80GB)
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	18304.0	0.0	0.3	3.1	0.1	3.3	0.0	6.9	7.0	7.1	7.4	6.9
16	16	20448.0	0.0	0.5	6.6	0.1	5.2	0.0	12.5	15.5	15.6	17.1	12.4
16	24	24448.0	0.0	0.7	8.3	0.2	6.3	0.1	17.4	17.6	17.7	17.8	15.5
16	32	25312.0	0.0	0.8	10.2	0.2	8.5	0.1	22.8	24.4	24.7	24.9	19.8
16	40	23232.0	0.0	1.2	14.2	0.4	11.3	0.1	28.7	30.3	30.4	30.5	27.1
16	48	25296.0	0.0	1.4	9.1	0.4	18.6	0.1	31.0	32.7	32.7	33.0	29.7
16	56	26560.0	0.0	1.4	16.2	0.4	14.8	0.1	34.4	40.2	40.4	40.6	32.9
16	64	26848.0	0.0	2.0	16.6	0.4	17.8	0.1	38.6	39.0	39.1	39.2	36.9
16	72	27632.0	0.0	1.8	22.4	0.5	16.6	0.1	42.2	47.5	47.7	48.2	41.4
16	80	27808.0	0.0	1.9	25.7	0.5	16.9	0.1	47.9	48.2	48.4	48.8	45.2
16	88	29152.0	0.0	2.5	22.8	0.6	21.1	0.1	48.7	49.4	50.4	50.6	47.2
16	96	26352.0	0.0	2.0	33.5	0.6	20.1	0.2	58.2	58.8	58.9	59.1	56.5
16	104	31824.0	0.0	2.1	27.9	0.8	20.5	0.2	53.0	53.5	53.6	53.7	51.6
16	112	34992.0	0.0	3.2	24.8	0.9	21.8	0.2	51.8	59.5	61.5	67.9	50.9
16	120	34496.0	0.0	1.9	29.8	0.9	22.3	0.2	58.8	66.3	66.7	72.2	55.2
16	128	36784.0	0.0	2.7	30.6	1.1	20.0	0.2	54.4	59.0	59.1	59.6	54.5
16	136	36912.0	0.0	2.3	33.8	0.9	20.4	0.2	59.0	59.3	59.5	59.6	57.7
16	144	32672.0	0.1	2.7	42.2	1.1	21.9	0.2	69.1	71.4	72.9	73.8	68.2
16	152	36576.0	0.1	1.6	37.4	1.3	23.4	0.2	66.4	70.2	77.5	78.2	63.9
16	160	37824.0	0.1	2.2	42.0	0.9	20.9	0.2	67.1	72.1	77.5	81.7	66.3
16	168	35536.0	0.1	1.8	49.0	0.8	21.1	0.2	77.4	81.7	81.9	82.0	72.9
16	176	35488.0	0.1	2.6	51.3	0.8	21.5	0.2	81.6	82.2	82.4	90.9	76.5
16	184	33744.0	0.1	3.7	56.2	0.8	22.4	0.2	81.8	91.8	92.1	99.1	83.3
16	192	38032.0	0.1	2.4	51.4	1.1	22.4	0.2	82.5	83.2	88.0	92.1	77.7
16	200	39632.0	0.1	2.5	49.4	0.9	23.9	0.2	78.3	83.0	83.3	90.1	76.9
16	208	34400.0	0.1	2.1	66.7	1.1	21.9	0.2	92.5	93.1	93.3	93.5	92.2
16	216	31712.0	0.1	2.3	80.2	0.9	20.9	0.2	104.7	105.1	105.2	105.7	104.5
16	224	38016.0	0.1	2.4	65.3	1.2	21.4	0.2	90.2	93.1	93.2	93.3	90.7
16	232	37168.0	0.1	1.8	72.2	1.1	19.7	0.2	95.2	95.8	95.9	96.0	95.1
16	240	40832.0	0.1	2.1	60.9	0.9	24.6	0.2	87.7	105.3	108.2	112.9	88.8
16	248	38272.0	0.1	2.4	71.3	1.3	23.1	0.2	99.2	102.3	110.3	110.8	98.5
16	256	33472.0	0.1	2.4	90.1	1.1	21.9	0.2	115.9	116.9	117.4	117.8	115.9

Online: NVIDIA DGX A100 (1x A100 80GB), NVIDIA TensorRT with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX A100 (1x A100 80GB)
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	18816.0	0.0	0.2	3.1	0.1	3.3	0.0	6.8	6.8	6.9	6.9	6.8
16	16	20720.0	0.0	0.4	6.5	0.2	5.0	0.1	12.4	15.6	15.9	17.1	12.2
16	24	23424.0	0.0	0.6	8.9	0.2	6.4	0.1	17.6	19.5	19.6	19.8	16.2
16	32	23840.0	0.0	1.2	10.4	0.4	9.2	0.1	23.1	23.4	23.5	23.6	21.3
16	40	27972.0	0.0	1.3	11.2	0.4	9.6	0.1	23.8	25.2	25.3	25.5	22.6
16	48	28704.0	0.0	1.5	13.3	0.4	11.2	0.1	28.6	29.0	29.1	30.6	26.5
16	56	26464.0	0.0	1.8	17.3	0.7	13.1	0.1	32.6	40.4	40.6	40.8	33.1
16	64	27536.0	0.0	1.4	21.8	0.3	12.5	0.1	37.9	38.3	38.7	40.7	36.2
16	72	33680.0	0.0	1.5	13.5	0.8	17.8	0.1	35.0	38.4	38.8	40.4	33.7
16	80	27984.0	0.0	1.6	25.5	0.5	16.6	0.1	47.7	48.2	48.3	48.6	44.4
16	88	36464.0	0.0	1.9	16.8	0.9	18.2	0.2	39.0	40.7	40.9	41.1	37.9
16	96	35792.0	0.0	1.9	21.1	0.7	17.4	0.1	42.7	43.0	43.1	43.2	41.4
16	104	35536.0	0.0	2.1	25.9	0.7	17.6	0.1	48.0	48.2	48.4	48.6	46.4
16	112	30448.0	0.0	2.0	33.5	0.9	20.1	0.1	58.2	58.7	58.9	59.0	56.8
16	120	32480.0	0.0	2.9	32.9	0.8	20.3	0.2	58.6	59.0	59.2	60.4	57.2
16	128	34528.0	0.0	2.7	33.1	1.0	20.4	0.2	58.7	59.1	59.2	59.3	57.4
16	136	37424.0	0.1	1.8	34.3	0.9	19.9	0.2	58.9	59.4	60.0	60.3	57.1
16	144	33552.0	0.0	2.5	41.1	0.9	21.8	0.2	68.9	69.2	69.3	69.5	66.6
16	152	35104.0	0.1	2.2	43.0	1.0	21.4	0.2	69.2	72.3	76.7	81.6	67.7
16	160	31984.0	0.1	2.3	52.8	0.9	20.4	0.2	81.4	82.0	91.3	91.4	76.7
16	168	35456.0	0.1	2.4	49.3	0.9	20.9	0.2	71.3	91.3	91.6	92.1	73.8
16	176	33200.0	0.1	2.2	57.0	1.0	20.8	0.2	82.1	84.1	91.7	92.2	81.2
16	184	32752.0	0.1	1.6	60.2	0.9	21.0	0.2	81.8	92.0	92.3	92.4	84.1
16	192	36192.0	0.1	2.4	54.7	1.1	23.1	0.2	84.2	92.2	92.3	93.0	81.7
16	200	37424.0	0.1	2.8	56.8	0.9	20.8	0.2	82.0	82.2	82.3	82.4	81.6
16	208	35616.0	0.1	2.1	63.3	0.9	22.8	0.2	91.7	100.4	104.0	104.6	89.3
16	216	37200.0	0.1	2.6	63.9	1.1	21.0	0.2	89.2	89.5	89.6	89.7	88.8
16	224	32512.0	0.1	2.1	80.5	0.9	20.7	0.2	104.6	105.0	105.1	105.6	104.5
16	232	40944.0	0.1	2.0	59.3	1.0	24.4	0.2	89.3	93.4	100.7	101.8	87.0
16	240	37952.0	0.1	2.2	74.6	1.0	17.7	0.2	94.0	101.3	101.6	103.8	95.7
16	248	37744.0	0.2	2.2	74.6	1.0	23.0	0.2	101.8	113.0	113.4	114.6	101.1
16	256	31120.0	0.1	2.0	100.8	0.9	20.1	0.1	124.2	124.9	125.1	125.5	124.2

Online: NVIDIA DGX A100 (1x A100 80GB), PyTorch with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX A100 (1x A100 80GB)
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	8080.0	0.0	0.2	5.0	0.2	10.1	0.1	19.3	20.4	20.5	20.9	15.6
16	16	12275.7	0.0	0.4	7.8	0.4	10.1	1.7	23.3	25.3	25.9	26.3	20.4
16	24	15072.0	0.0	0.6	10.2	0.5	10.5	2.9	27.3	28.4	28.8	29.6	24.8
16	32	17616.0	0.0	1.0	11.7	0.6	12.0	3.1	30.9	32.0	32.3	32.6	28.5
16	40	19024.0	0.0	0.9	14.2	0.8	11.7	5.3	34.9	36.7	37.4	47.0	32.9
16	48	19312.0	0.1	2.1	12.1	1.1	11.8	12.2	39.9	46.1	49.0	54.4	39.2
16	56	20848.0	0.0	1.4	17.9	1.1	10.0	11.1	43.6	44.9	46.0	50.8	41.6
16	64	21456.0	0.0	1.9	14.9	1.4	9.7	18.6	48.2	50.1	51.0	51.3	46.5
16	72	21600.0	0.0	4.1	19.6	1.1	10.4	16.9	53.9	54.5	54.7	55.8	52.0
16	80	22192.0	0.1	2.1	24.1	2.2	9.5	18.0	57.9	60.0	61.5	63.2	56.0
16	88	22304.0	0.0	2.1	27.6	3.2	8.8	19.4	63.5	66.0	66.1	77.3	61.2
16	96	22176.0	0.0	2.6	29.3	4.1	8.7	21.6	68.6	71.9	76.1	79.0	66.3
16	104	22416.0	0.0	4.4	30.2	1.6	10.8	24.1	73.4	75.0	75.9	76.5	71.1
16	112	22096.0	0.1	2.9	33.8	10.6	7.4	23.1	81.6	83.9	84.4	90.5	77.8
16	120	22320.0	0.1	3.0	34.8	10.2	7.9	25.9	85.6	90.2	102.7	116.7	81.9
16	128	22544.0	0.1	2.9	38.9	12.9	7.1	25.4	91.8	95.3	103.6	105.4	87.3
16	136	22704.0	0.1	3.8	40.5	13.9	7.1	25.9	95.4	97.8	98.6	114.4	91.3
16	144	22224.0	0.1	2.3	42.4	18.0	6.8	26.6	101.8	107.1	108.3	108.4	96.1
16	152	22992.0	0.1	3.3	45.4	19.0	6.8	26.6	105.8	107.6	108.0	108.8	101.2
16	160	23328.0	0.1	2.5	47.8	11.5	7.6	34.7	106.5	121.2	123.0	140.4	104.2
16	168	22448.0	0.1	3.7	50.4	15.0	8.8	32.7	112.6	123.8	126.9	131.8	110.6
16	176	22640.0	0.1	3.6	53.3	14.9	7.7	35.1	118.0	124.1	128.9	144.0	114.7
16	184	22937.1	0.1	4.0	52.5	23.3	7.1	32.7	124.3	126.2	127.4	128.0	119.6
16	192	23768.2	0.1	3.6	56.4	20.6	7.1	36.2	127.9	130.7	136.4	139.0	124.0
16	200	23584.0	0.1	3.9	57.8	24.4	7.2	35.5	136.1	139.0	140.3	140.7	128.7
16	208	23192.8	0.1	4.8	62.0	20.9	7.8	38.9	140.9	145.3	170.9	187.7	134.5
16	216	22873.1	0.1	3.6	80.7	17.8	7.4	32.5	145.1	152.1	158.8	159.7	142.0
16	224	23360.0	0.1	3.7	76.7	19.9	7.4	36.1	145.4	153.1	166.4	168.8	144.0
16	232	23152.0	0.1	3.8	83.3	17.8	7.8	38.2	151.2	162.3	176.8	185.3	150.9
16	240	22384.0	0.1	4.1	88.6	21.1	7.1	34.2	157.6	161.1	166.3	170.4	155.1
16	248	22608.0	0.2	4.5	93.4	18.5	9.3	34.8	163.3	172.8	186.2	199.5	160.8
16	256	22320.0	0.1	3.0	94.1	16.6	8.1	41.7	165.4	178.2	188.9	202.4	163.7

Online: NVIDIA DGX A100 (1x A100 80GB), PyTorch with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA DGX A100 (1x A100 80GB)
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	8032.0	0.0	0.3	5.0	0.2	10.0	0.1	19.3	20.2	20.4	21.0	15.6
16	16	12784.0	0.0	0.4	7.5	0.4	9.8	1.6	22.8	23.6	23.8	24.3	19.8
16	24	15888.0	0.0	0.7	9.3	0.5	9.8	3.6	26.5	27.3	27.5	27.7	23.9
16	32	17952.0	0.0	0.7	10.8	0.6	9.8	6.1	30.5	31.1	31.4	31.6	28.0
16	40	19376.0	0.0	1.0	12.6	0.7	9.7	8.1	34.5	35.3	35.5	35.7	32.2
16	48	20528.0	0.0	1.4	15.9	0.9	9.6	8.6	38.7	39.5	39.8	40.1	36.4
16	56	20848.0	0.0	1.2	18.5	0.9	10.3	10.7	43.8	45.2	45.6	46.3	41.7
16	64	21968.0	0.0	1.6	20.6	0.9	10.2	12.5	48.0	48.7	48.9	49.3	45.9
16	72	22144.0	0.1	1.7	20.8	1.2	9.8	16.7	52.5	53.6	54.1	54.7	50.3
16	80	22656.0	0.0	2.2	23.2	2.6	9.0	18.4	57.6	59.4	59.8	62.7	55.5
16	88	23208.8	0.0	2.6	26.3	2.0	9.9	18.7	61.5	62.6	62.9	68.4	59.5
16	96	22464.0	0.0	2.6	27.4	2.6	9.0	23.7	67.3	69.6	73.2	79.3	65.4
16	104	22752.0	0.0	2.9	31.8	3.7	8.7	22.9	72.4	76.1	78.1	85.2	70.0
16	112	23352.6	0.1	3.6	31.8	1.5	10.6	27.3	76.3	80.4	82.2	87.4	74.9
16	120	22592.0	0.1	3.7	34.0	7.5	8.1	28.6	83.8	86.1	88.0	107.9	81.9
16	128	22288.0	0.1	3.7	38.1	8.8	8.1	26.6	87.9	99.0	100.6	113.3	85.4
16	136	23440.0	0.1	3.1	38.2	16.5	6.7	25.4	94.0	99.6	100.7	102.5	90.1
16	144	22864.0	0.1	2.8	43.7	14.4	7.3	27.5	99.4	102.7	104.8	121.1	95.7
16	152	23224.8	0.1	3.9	45.5	11.7	7.6	31.4	103.0	108.4	116.6	128.1	100.2
16	160	22496.0	0.1	4.3	46.8	13.1	7.7	34.3	110.5	115.9	125.3	136.9	106.2
16	168	23760.0	0.1	3.4	49.5	18.7	7.2	29.3	111.9	113.3	113.8	135.5	108.1
16	176	23328.0	0.1	3.9	51.5	21.3	7.6	29.1	116.8	120.4	121.2	124.7	113.5
16	184	23440.0	0.1	4.1	52.6	21.0	6.9	34.0	123.0	127.5	128.1	129.3	118.6
16	192	23728.0	0.1	3.7	56.8	19.4	7.0	35.9	122.8	123.1	123.2	123.3	122.8
16	200	23808.0	0.1	4.8	57.8	23.0	7.0	33.6	128.3	132.6	133.2	136.8	126.3
16	208	23856.0	0.1	4.2	59.0	25.7	7.2	35.1	138.1	140.9	141.2	141.6	131.2
16	216	23200.0	0.1	3.6	64.5	23.8	6.9	36.7	135.5	136.1	136.6	136.7	135.6
16	224	24384.0	0.1	4.8	67.1	24.7	6.7	36.5	139.9	140.9	141.1	142.8	139.9
16	232	23040.0	0.1	4.1	83.9	20.1	7.0	33.5	152.9	158.9	168.2	169.6	148.6
16	240	23496.5	0.1	3.1	87.0	20.9	7.1	35.2	156.1	159.9	168.7	171.1	153.3
16	248	23072.0	0.1	4.1	95.5	13.4	8.5	38.0	161.2	178.6	179.7	193.0	159.5
16	256	21952.0	0.1	4.0	97.0	15.3	7.7	38.3	164.7	186.0	192.8	194.8	162.4

Online: NVIDIA T4, NVIDIA TensorRT with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA T4
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	10048.0	0.1	0.7	5.3	0.1	6.3	0.0	12.6	12.7	12.7	12.8	12.6
16	16	8464.0	0.1	1.0	15.6	0.2	13.0	0.0	30.5	41.0	41.5	41.7	29.9
16	24	9472.0	0.1	1.4	19.2	0.2	17.9	0.0	41.4	57.5	57.8	62.8	38.9
16	32	9568.0	0.1	2.0	20.2	0.3	30.3	0.0	57.4	61.5	61.6	61.7	53.1
16	40	9616.0	0.1	2.4	31.6	0.3	29.4	0.0	70.4	71.3	71.6	72.0	63.9
16	48	9872.0	0.1	3.8	34.9	0.5	35.9	0.1	71.1	108.0	108.8	109.3	75.3
16	56	9024.0	0.1	2.8	54.7	0.3	36.5	0.0	100.7	101.2	101.7	101.8	94.5
16	64	9536.0	0.1	4.1	37.6	0.6	61.2	0.1	108.4	109.0	109.3	109.5	103.7
16	72	8016.0	0.1	3.7	74.4	0.5	53.0	0.0	137.2	138.0	138.3	138.5	131.7
16	80	9328.0	0.1	3.8	71.0	0.6	57.2	0.1	137.5	138.6	139.6	139.8	132.7
16	88	8240.0	0.1	3.0	85.8	0.6	61.5	0.0	158.5	175.1	176.1	176.9	151.0
16	96	9504.0	0.1	3.8	91.9	0.6	57.2	0.0	158.4	159.8	160.6	196.6	153.7
16	104	9526.5	0.2	3.6	96.2	0.8	69.6	0.0	175.4	176.3	176.3	176.6	170.4
16	112	9424.0	0.2	3.8	94.8	0.9	70.9	0.1	175.9	176.9	177.0	177.1	170.6
16	120	9280.0	0.2	4.0	116.7	0.9	69.5	0.1	196.2	196.8	196.9	197.2	191.4
16	128	9552.0	0.2	4.3	116.8	0.9	69.3	0.1	196.4	197.2	197.4	197.6	191.5
16	136	10165.8	0.3	3.3	117.3	1.0	69.4	0.1	196.9	197.4	197.6	197.8	191.4
16	144	10400.0	0.3	4.6	115.3	1.0	70.9	0.1	196.6	197.2	197.4	197.7	192.1
16	152	9350.6	0.3	5.1	146.4	1.0	77.2	0.1	234.6	235.3	235.6	236.0	230.1
16	160	9744.0	0.3	4.8	145.9	1.1	77.0	0.1	234.1	234.9	235.3	235.6	229.2
16	168	7520.0	0.5	2.7	220.8	0.9	77.2	0.1	311.0	312.4	312.5	312.8	301.9
16	176	7880.1	0.5	4.0	227.3	0.9	77.0	0.1	311.6	312.7	312.8	313.1	309.8
16	184	9760.0	0.8	5.3	183.3	1.0	73.3	0.1	256.0	275.9	276.2	276.4	263.9
16	192	9312.0	0.8	3.8	197.8	0.9	70.4	0.1	275.1	275.9	276.0	276.5	273.9
16	200	8880.0	0.9	3.5	229.1	1.0	77.2	0.1	312.8	313.9	314.0	314.2	311.7
16	208	10992.0	1.1	3.4	188.8	1.1	71.6	0.2	266.3	266.9	267.1	267.5	266.1
16	216	9600.0	0.8	4.8	228.0	1.1	77.2	0.1	313.0	314.2	314.5	315.4	311.9
16	224	9776.0	1.1	3.8	228.5	1.1	77.2	0.1	313.0	313.7	313.8	314.0	311.9
16	232	10928.0	1.1	3.5	220.3	1.1	69.4	0.1	296.0	296.9	297.0	297.4	295.5
16	240	10752.0	1.3	4.2	228.7	1.1	77.2	0.2	313.3	314.0	314.1	314.3	312.8
16	248	9878.1	1.4	5.1	249.7	1.2	74.8	0.2	332.9	334.1	334.3	334.6	332.4
16	256	10368.0	1.2	4.7	251.1	1.1	74.9	0.2	333.6	334.4	334.6	335.3	333.2

Online: NVIDIA T4, NVIDIA TensorRT with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA T4
Backend	NVIDIA TensorRT
Precision	FP16
Model format	NVIDIA TensorRT
Max batch size	1024
Number of model instances	2
Export Precision	FP32
NVIDIA TensorRT Capture CUDA Graph	Disabled
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	10176.0	0.1	0.7	5.2	0.1	6.2	0.0	12.4	12.5	12.6	12.6	12.4
16	16	8880.0	0.1	0.9	14.6	0.1	12.4	0.0	28.6	37.0	41.6	41.9	28.3
16	24	9520.0	0.1	1.3	19.9	0.2	17.8	0.0	41.6	50.9	57.3	61.9	39.4
16	32	9152.0	0.1	2.1	21.0	0.3	30.8	0.0	57.9	62.3	63.1	65.2	54.3
16	40	9712.0	0.1	2.7	30.0	0.3	31.6	0.0	70.7	71.2	71.4	71.6	64.8
16	48	8000.0	0.1	3.4	28.3	0.4	61.5	0.1	95.8	104.0	104.1	104.2	93.7
16	56	9376.0	0.1	3.9	24.7	0.6	64.1	0.1	95.4	104.5	105.3	106.0	93.4
16	64	8192.0	0.1	3.4	55.8	0.5	58.8	0.0	124.4	124.7	125.2	125.3	118.7
16	72	8432.0	0.1	2.2	73.0	0.5	51.0	0.0	137.8	138.8	139.1	139.4	126.9
16	80	8944.0	0.1	4.3	71.9	0.5	55.9	0.1	137.2	138.6	138.8	139.0	132.7
16	88	7936.0	0.1	3.0	93.5	0.7	72.3	0.1	175.2	176.1	176.3	176.4	169.6
16	96	9152.0	0.2	3.0	92.8	0.7	56.4	0.1	159.0	159.4	159.5	159.8	153.1
16	104	9510.5	0.1	3.5	93.2	0.7	57.0	0.1	159.3	159.9	159.9	160.1	154.6
16	112	10709.3	0.2	2.8	91.4	0.9	61.3	0.1	159.2	160.2	160.4	196.7	156.7
16	120	8848.0	0.2	3.5	116.2	0.9	70.3	0.1	196.7	198.1	198.5	199.3	191.2
16	128	9472.0	0.2	3.8	118.7	0.8	68.4	0.1	196.6	197.2	197.3	197.4	192.0
16	136	10208.0	0.2	4.1	117.3	0.9	69.6	0.1	196.9	197.8	198.1	199.0	192.2
16	144	8599.4	0.2	4.2	146.6	0.9	77.2	0.1	234.1	235.2	235.7	236.0	229.3
16	152	9110.9	0.3	4.2	146.5	1.0	77.3	0.1	235.0	235.6	235.7	236.0	229.4
16	160	7680.0	0.4	3.2	196.0	0.8	72.5	0.1	274.5	275.2	275.6	276.1	273.1
16	168	9968.0	0.5	4.3	147.3	1.2	77.3	0.1	234.8	236.1	236.3	236.7	230.7
16	176	9248.0	0.6	3.4	197.3	0.9	71.7	0.1	275.6	276.8	276.9	277.1	274.0
16	184	8871.1	0.6	4.2	203.9	1.1	70.7	0.1	275.5	313.3	313.9	314.6	280.6
16	192	11252.7	0.5	5.4	151.3	1.5	77.1	0.1	235.9	237.3	237.6	238.7	235.9
16	200	10896.0	0.8	3.9	175.2	1.2	73.2	0.2	255.9	256.5	256.6	257.4	254.4
16	208	11040.0	1.1	3.5	195.6	1.1	73.1	0.1	275.9	276.8	276.9	277.1	274.6
16	216	10384.0	1.1	4.0	215.2	1.1	71.2	0.1	295.2	296.3	296.7	297.4	292.8
16	224	10752.0	0.9	4.5	224.8	1.4	70.8	0.1	297.4	317.0	317.4	318.4	302.5
16	232	10144.0	1.0	3.7	244.1	1.0	75.1	0.2	324.5	332.0	332.9	333.0	325.0
16	240	10560.0	1.2	4.4	228.1	1.1	77.3	0.2	313.6	314.8	315.0	315.2	312.3
16	248	10896.0	1.5	4.0	245.3	1.2	75.3	0.2	326.0	334.1	334.5	335.4	327.5
16	256	11264.0	1.5	4.3	230.6	1.7	77.0	0.2	315.4	316.4	316.6	317.0	315.4

Online: NVIDIA T4, PyTorch with FP16, Dataset: electricity

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA T4
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	electricity
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	3264.0	0.1	0.6	13.9	0.8	8.9	14.2	43.8	47.8	50.1	52.1	38.5
16	16	3669.3	0.1	1.0	26.2	2.0	9.1	30.3	76.8	82.8	84.7	86.7	68.7
16	24	3760.0	0.1	1.6	37.0	2.7	9.1	50.0	111.8	114.0	114.5	117.8	100.4
16	32	3818.7	0.1	1.3	58.1	1.9	9.0	61.7	143.8	146.6	148.1	150.5	132.2
16	40	3801.4	0.1	3.0	69.5	2.0	8.9	80.0	175.5	180.4	180.8	181.7	163.4
16	48	3822.7	0.1	3.4	77.8	6.0	9.1	98.1	205.7	209.7	211.7	216.0	194.6
16	56	3785.4	0.1	4.7	77.8	4.2	8.8	128.9	236.4	239.9	241.8	242.0	224.5
16	64	3669.3	0.1	4.8	65.2	10.4	8.4	169.2	270.8	277.5	278.0	278.2	258.2
16	72	3769.4	0.1	4.6	129.8	5.5	8.2	140.6	300.9	305.2	306.5	306.8	288.8
16	80	3528.0	0.1	4.7	102.8	15.8	7.3	190.4	335.5	342.8	342.9	384.7	321.2
16	88	3594.7	0.1	4.0	158.6	15.5	9.1	163.3	363.4	369.4	370.6	420.0	350.6
16	96	3700.1	0.1	4.4	187.4	22.6	8.4	159.2	394.9	397.8	398.7	412.2	382.2
16	104	3710.8	0.1	6.4	191.4	31.9	8.7	178.8	430.1	432.2	463.7	465.9	417.4
16	112	3680.0	0.1	6.1	213.8	33.0	8.5	187.7	461.4	464.6	465.3	465.5	449.4
16	120	3616.0	0.1	7.5	158.8	27.8	7.7	274.8	489.4	493.1	500.8	501.0	476.8
16	128	3514.7	0.2	5.2	188.4	83.0	8.0	223.8	525.3	531.1	531.6	573.8	508.6
16	136	3716.1	0.2	5.4	243.3	67.8	8.0	210.6	547.8	551.0	551.6	552.1	535.2
16	144	3168.0	0.2	3.6	263.3	76.0	8.6	213.1	583.8	720.5	720.8	721.4	564.8
16	152	3642.7	0.2	6.6	232.6	57.1	7.4	292.4	607.9	609.5	610.0	619.0	596.4
16	160	3512.0	0.3	3.6	280.5	119.6	7.3	221.4	647.3	650.8	651.4	666.6	632.7
16	168	3206.4	0.2	6.4	283.2	116.6	7.9	243.7	669.6	670.4	670.5	670.7	657.9
16	176	3550.8	0.4	6.3	334.8	109.5	7.0	239.9	710.4	714.1	720.1	722.4	697.9
16	184	3462.3	0.4	5.4	334.5	141.1	6.6	235.4	739.5	741.4	755.4	755.7	723.5
16	192	3232.0	0.4	6.8	350.1	135.7	7.2	255.5	769.6	774.4	786.3	786.6	755.7
16	200	3578.7	0.5	5.9	366.7	157.9	6.5	250.9	801.4	807.8	808.4	808.8	788.3
16	208	3384.0	0.4	5.7	384.7	134.6	7.5	283.0	827.6	832.8	836.8	837.3	816.0
16	216	2952.0	0.7	5.4	419.1	145.7	6.8	265.2	844.8	851.7	851.8	852.1	842.9
16	224	3198.4	0.8	1.5	491.9	138.6	6.9	231.5	882.4	900.1	901.0	904.3	871.1
16	232	3370.7	1.1	6.2	436.3	169.3	7.0	281.1	900.1	906.2	906.4	906.6	900.9
16	240	3514.7	1.2	4.7	457.9	188.6	7.5	278.4	941.9	947.9	948.0	948.2	938.4
16	248	3294.9	1.1	6.2	572.9	132.5	8.2	259.2	981.8	987.8	990.1	990.2	980.0
16	256	3144.0	0.7	8.5	602.8	120.8	7.3	269.7	1010.5	1247.8	1248.0	1248.8	1009.9

Online: NVIDIA T4, PyTorch with FP16, Dataset: traffic

Our results were obtained using the following configuration:

Parameter Name	Parameter Value
GPU	NVIDIA T4
Backend	PyTorch
Precision	FP16
Model format	TorchScript Trace
Max batch size	1024
Number of model instances	2
Export Precision	FP32
Dataset	traffic
Device	gpu
Request Count	500

Results Table

Batch	Concurrency	Inferences/Second	Client Send (ms)	Network+Server Send/Recv (ms)	Server Queue (ms)	Server Compute Input (ms)	Server Compute Infer (ms)	Server Compute Output (ms)	p50 latency (ms)	p90 latency (ms)	p95 latency (ms)	p99 latency (ms)	avg latency (ms)
16	8	3486.8	0.1	0.8	10.6	1.6	10.0	13.2	43.3	47.9	48.4	49.4	36.5
16	16	3668.1	0.1	0.9	25.4	2.2	9.1	30.4	77.2	82.6	83.9	87.3	68.0
16	24	3764.1	0.1	1.4	40.4	2.2	9.1	46.5	111.1	115.9	116.9	117.6	99.7
16	32	3822.7	0.1	2.2	56.6	1.8	8.9	61.3	142.5	145.5	147.1	151.0	130.9
16	40	3785.4	0.1	2.6	69.6	1.9	8.9	79.1	174.4	179.3	180.0	181.6	162.2
16	48	3854.7	0.1	4.3	67.3	4.2	8.9	107.5	205.1	209.3	209.5	212.6	192.4
16	56	3786.7	0.1	3.2	99.9	5.0	8.5	108.0	236.7	240.9	242.2	242.8	224.7
16	64	3882.7	0.1	6.3	65.8	8.2	8.3	168.3	269.1	275.5	276.0	378.1	257.1
16	72	3690.7	0.1	6.5	103.0	11.5	8.0	159.3	300.2	303.5	304.8	391.1	288.5
16	80	3669.3	0.1	6.9	95.3	19.2	7.0	193.2	333.9	338.4	338.6	339.3	321.8
16	88	3646.2	0.1	4.8	145.9	22.0	7.1	171.3	364.1	368.4	368.6	368.7	351.2
16	96	3712.0	0.1	6.3	174.7	32.3	7.0	159.8	394.4	399.8	400.2	400.6	380.1
16	104	3701.3	0.1	5.2	192.4	39.3	7.1	169.3	427.6	434.3	434.4	435.1	413.5
16	112	3686.2	0.1	5.8	204.9	41.2	6.9	186.4	458.5	462.0	462.3	464.8	445.5
16	120	3600.0	0.2	5.6	221.5	28.2	7.2	211.1	487.2	491.1	491.7	491.9	473.7
16	128	3656.0	0.2	9.2	157.3	27.6	6.8	307.7	518.4	525.4	525.5	526.8	508.7
16	136	3710.8	0.2	6.8	249.1	83.8	7.3	191.2	552.1	555.3	562.4	562.6	538.2
16	144	3593.5	0.2	5.3	267.5	77.6	6.8	213.9	583.8	586.1	587.0	587.8	571.3
16	152	3630.8	0.2	6.8	258.2	98.5	7.3	230.0	613.0	618.2	621.6	622.2	600.9
16	160	3464.0	0.2	8.6	259.1	112.2	6.8	240.4	640.7	644.5	644.6	644.8	627.2
16	168	3240.0	0.3	6.4	278.2	104.2	7.2	261.6	672.9	676.3	676.5	677.1	657.9
16	176	3376.0	0.3	6.2	298.0	126.7	6.1	254.5	701.3	706.9	707.0	707.2	691.8
16	184	3632.0	0.3	7.2	334.7	125.6	7.4	249.8	737.0	741.4	745.2	745.6	725.0
16	192	3504.0	0.5	7.5	362.4	125.7	7.2	252.9	766.8	768.9	769.1	769.3	756.1
16	200	3246.4	0.5	5.1	360.5	161.5	6.7	247.9	794.4	797.6	797.7	798.1	782.2
16	208	3344.0	0.4	5.6	463.1	109.0	7.1	234.1	827.3	830.1	830.4	859.6	819.4
16	216	3192.0	0.4	9.0	409.4	153.2	6.9	268.5	859.0	862.5	862.6	862.8	847.3
16	224	3312.0	0.5	6.5	424.0	179.8	6.6	257.1	888.1	893.6	900.8	901.6	874.5
16	232	3449.5	0.5	7.0	517.0	114.4	7.3	265.1	913.9	915.8	920.3	924.9	911.4
16	240	3392.0	0.7	12.9	555.7	100.4	8.9	289.1	952.8	1071.4	1138.9	1139.4	967.6
16	248	3321.6	0.7	6.1	474.4	132.1	8.3	339.2	959.6	967.6	968.1	968.5	960.8
16	256	3152.0	0.7	6.1	583.5	118.6	7.7	287.4	1008.6	1026.3	1042.2	1042.6	1004.0

Advanced

Inference runtime	Mnemonic used in scripts
TorchScript Tracing	`ts-trace`
TorchScript Scripting	`ts-script`
ONNX	`onnx`
NVIDIA TensorRT	`trt`

Step by step deployment process

Commands described below can be used for exporting, converting and profiling the model.

Clone Repository

IMPORTANT: This step is executed on the host computer.

Clone Repository Command

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/PyTorch/Forecasting/TFT

Setup Environment

Setup the environment in the host computer and start Triton Inference Server.

Setup Environment Command

source ./triton/scripts/setup_environment.sh
bash ./triton/scripts/docker/triton_inference_server.sh

Prepare Dataset.

Please use the data download from the Main QSG

Prepare Checkpoint

Please place a checkpoint.pt from TFT trained on electricity in runner_workspace/checkpoints/electricity_bin/. Note that the electricity_bin subdirectory may not be created yet. In addition one can download a zip archive of a trained checkpoint here

Setup Container

Build and run a container that extends the NGC PyTorch container with the Triton Inference Server client libraries and dependencies.

Setup Container Command

bash ./triton/scripts/docker/build.sh
bash ./triton/scripts/docker/interactive.sh /path/to/your/data/

Prepare configuration

You can use the environment variables to set the parameters of your inference configuration.

Example values of some key variables in one configuration:

Export Variables

WORKDIR="${WORKDIR:=$(pwd)}"
export DATASETS_DIR=${WORKDIR}/datasets
export WORKSPACE_DIR=${WORKDIR}/runner_workspace
export CHECKPOINTS_DIR=${WORKSPACE_DIR}/checkpoints
export MODEL_REPOSITORY_PATH=${WORKSPACE_DIR}/model_store
export SHARED_DIR=${WORKSPACE_DIR}/shared_dir
export MODEL_NAME=TFT
export ENSEMBLE_MODEL_NAME=
export TRITON_LOAD_MODEL_METHOD=explicit
export TRITON_INSTANCES=1
export FORMAT="trt"
export PRECISION="fp16"
export ACCELERATOR="none"
export TRITON_GPU_ENGINE_COUNT="2"
export CAPTURE_CUDA_GRAPH="0"
export BATCH_SIZE="1,2,4,8,16,32,64,128,256,512,1024"
export TRITON_MAX_QUEUE_DELAY="1"
export MAX_BATCH_SIZE="1024"
export BATCH_SIZES="1 2 4 8 16 32 64 128 256 512 1024"
export TRITON_PREFERRED_BATCH_SIZES="512 1024"
export EXPORT_FORMAT="onnx"
export EXPORT_PRECISION="fp32"
export DATASET="electricity_bin"
export DEVICE="gpu"
export REQUEST_COUNT="500"
export CHECKPOINT_VARIANT="electricity_bin"
export CHECKPOINT_DIR=${CHECKPOINTS_DIR}/${CHECKPOINT_VARIANT}

Export Model

Export model from Python source to desired format (e.g. Savedmodel or TorchScript)

Export Model Command

if [[ "${EXPORT_FORMAT}" == "ts-trace" || "${EXPORT_FORMAT}" == "ts-script" ]]; then
    export FORMAT_SUFFIX="pt"
else
    export FORMAT_SUFFIX="${EXPORT_FORMAT}"
fi
python3 triton/export_model.py \
    --input-path triton/model.py \
    --input-type pyt \
    --output-path ${SHARED_DIR}/exported_model.${FORMAT_SUFFIX} \
    --output-type ${EXPORT_FORMAT} \
    --ignore-unknown-parameters \
    --onnx-opset 13 \
    \
    --checkpoint ${CHECKPOINT_DIR}/ \
    --precision ${EXPORT_PRECISION} \
    \
    --dataloader triton/dataloader.py \
    --dataset ${DATASETS_DIR}/${DATASET} \
    --batch-size 1

Convert Model

Convert the model from training to inference format (e.g. TensorRT).

Convert Model Command

if [[ "${EXPORT_FORMAT}" == "ts-trace" || "${EXPORT_FORMAT}" == "ts-script" ]]; then
    export FORMAT_SUFFIX="pt"
else
    export FORMAT_SUFFIX="${EXPORT_FORMAT}"
fi
model-navigator convert \
    --model-name ${MODEL_NAME} \
    --model-path ${SHARED_DIR}/exported_model.${FORMAT_SUFFIX} \
    --output-path ${SHARED_DIR}/converted_model \
    --target-formats ${FORMAT} \
    --target-precisions ${PRECISION} \
    --launch-mode local \
    --override-workspace \
    --verbose \
    \
    --onnx-opsets 13 \
    --max-batch-size ${MAX_BATCH_SIZE} \
    --container-version 21.08 \
    --max-workspace-size 10000000000 \
    --atol target__0=100 \
    --rtol target__0=100

Deploy Model

Configure the model on Triton Inference Server. Generate the configuration from your model repository.

Deploy Model Command

if [[ "${FORMAT}" == "ts-trace" || "${FORMAT}" == "ts-script" ]]; then
    export CONFIG_FORMAT="torchscript"
else
    export CONFIG_FORMAT="${FORMAT}"
fi
model-navigator triton-config-model \
    --model-repository ${MODEL_REPOSITORY_PATH} \
    --model-name ${MODEL_NAME} \
    --model-version 1 \
    --model-path ${SHARED_DIR}/converted_model \
    --model-format ${CONFIG_FORMAT} \
    --model-control-mode ${TRITON_LOAD_MODEL_METHOD} \
    --load-model \
    --load-model-timeout-s 100 \
    --verbose \
    \
    --backend-accelerator ${ACCELERATOR} \
    --tensorrt-precision ${PRECISION} \
    --tensorrt-capture-cuda-graph \
    --tensorrt-max-workspace-size 10000000000 \
    --max-batch-size ${MAX_BATCH_SIZE} \
    --batching dynamic \
    --preferred-batch-sizes ${TRITON_PREFERRED_BATCH_SIZES} \
    --max-queue-delay-us ${TRITON_MAX_QUEUE_DELAY} \
    --engine-count-per-device ${DEVICE}=${TRITON_GPU_ENGINE_COUNT}

Prepare Triton Profiling Data

Prepare data used for profiling on Triton server.

Prepare Triton Profiling Data Command

mkdir -p ${SHARED_DIR}/input_data

python triton/prepare_input_data.py \
    --input-data-dir ${SHARED_DIR}/input_data/ \
    --dataset ${DATASETS_DIR}/${DATASET} \
    --checkpoint ${CHECKPOINT_DIR}/

Triton Performance Offline Test

We want to maximize throughput. It assumes you have your data available for inference or that your data saturate to maximum batch size quickly. Triton Inference Server supports offline scenarios with static batching. Static batching allows inference requests to be served as they are received. The largest improvements to throughput come from increasing the batch size due to efficiency gains in the GPU with larger batches.

Triton Performance Offline Test Command

python triton/run_performance_on_triton.py \
    --model-repository ${MODEL_REPOSITORY_PATH} \
    --model-name ${MODEL_NAME} \
    --input-data ${SHARED_DIR}/input_data/data.json \
    --batch-sizes ${BATCH_SIZE} \
    --number-of-triton-instances ${TRITON_INSTANCES} \
    --batching-mode static \
    --evaluation-mode offline \
    --measurement-request-count ${REQUEST_COUNT} \
    --warmup \
    --performance-tool perf_analyzer \
    --result-path ${SHARED_DIR}/triton_performance_offline.csv

Triton Performance Online Test

We want to maximize throughput within latency budget constraints. Dynamic batching is a feature of Triton Inference Server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in a reduced average latency.

Triton Performance Online Test

python triton/run_performance_on_triton.py \
    --model-repository ${MODEL_REPOSITORY_PATH} \
    --model-name ${MODEL_NAME} \
    --input-data ${SHARED_DIR}/input_data/data.json \
    --batch-sizes ${BATCH_SIZE} \
    --number-of-triton-instances ${TRITON_INSTANCES} \
    --number-of-model-instances ${TRITON_GPU_ENGINE_COUNT} \
    --batching-mode dynamic \
    --evaluation-mode online \
    --measurement-request-count 500 \
    --warmup \
    --performance-tool perf_analyzer \
    --result-path ${SHARED_DIR}/triton_performance_online.csv

Latency explanation

A typical Triton Inference Server pipeline can be broken down into the following steps:

The client serializes the inference request into a message and sends it to the server (Client Send).
The message travels over the network from the client to the server (Network).
The message arrives at the server and is deserialized (Server Receive).
The request is placed on the queue (Server Queue).
The request is removed from the queue and computed (Server Compute).
The completed request is serialized in a message and sent back to the client (Server Send).
The completed message then travels over the network from the server to the client (Network).
The completed message is deserialized by the client and processed as a completed inference request (Client Receive).

Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to step 5. As backend deep learning systems like TFT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of TFT, we can consider that all clients are local.

Release Notes

We’re constantly refining and improving our performance on AI and HPC workloads even on the same hardware with frequent updates to our software stack. For our latest performance data refer to these pages for AI and HPC benchmarks.

Changelog

February 2022

Initial release

Known issues

There are no known issues with this model.

Name		Name	Last commit message	Last commit date
parent directory ..
deployment_toolkit		deployment_toolkit
reports		reports
runner		runner
scripts		scripts
README.md		README.md
calculate_metrics.py		calculate_metrics.py
dataloader.py		dataloader.py
export_model.py		export_model.py
metrics.py		metrics.py
model.py		model.py
prepare_input_data.py		prepare_input_data.py
requirements.txt		requirements.txt
run_inference_on_fw.py		run_inference_on_fw.py
run_inference_on_triton.py		run_inference_on_triton.py
run_performance_on_triton.py		run_performance_on_triton.py

FilesExpand file tree

triton

Directory actions

More options

Directory actions

More options

Latest commit

History

triton

Folders and files

parent directory

README.md

Deploying the TFT model on Triton Inference Server

Table of contents

Solution overview

Introduction

Deployment process

Setup

Quick Start Guide

Performance

Offline scenario

Offline: NVIDIA A30, NVIDIA TensorRT with FP16, Dataset: electricity

Offline: NVIDIA A30, NVIDIA TensorRT with FP16, Dataset: traffic

Offline: NVIDIA A30, PyTorch with FP16, Dataset: electricity

Offline: NVIDIA A30, PyTorch with FP16, Dataset: traffic

Offline: NVIDIA DGX-1 (1x V100 32GB), NVIDIA TensorRT with FP16, Dataset: electricity

Offline: NVIDIA DGX-1 (1x V100 32GB), NVIDIA TensorRT with FP16, Dataset: traffic

Offline: NVIDIA DGX-1 (1x V100 32GB), PyTorch with FP16, Dataset: electricity

Offline: NVIDIA DGX-1 (1x V100 32GB), PyTorch with FP16, Dataset: traffic

Offline: NVIDIA DGX A100 (1x A100 80GB), NVIDIA TensorRT with FP16, Dataset: electricity

Offline: NVIDIA DGX A100 (1x A100 80GB), NVIDIA TensorRT with FP16, Dataset: traffic

Offline: NVIDIA DGX A100 (1x A100 80GB), PyTorch with FP16, Dataset: electricity

Offline: NVIDIA DGX A100 (1x A100 80GB), PyTorch with FP16, Dataset: traffic

Offline: NVIDIA T4, NVIDIA TensorRT with FP16, Dataset: electricity

Offline: NVIDIA T4, NVIDIA TensorRT with FP16, Dataset: traffic

Offline: NVIDIA T4, PyTorch with FP16, Dataset: electricity

Offline: NVIDIA T4, PyTorch with FP16, Dataset: traffic

Online scenario

Online: NVIDIA A30, NVIDIA TensorRT with FP16, Dataset: electricity

Online: NVIDIA A30, NVIDIA TensorRT with FP16, Dataset: traffic

Online: NVIDIA A30, PyTorch with FP16, Dataset: electricity

Online: NVIDIA A30, PyTorch with FP16, Dataset: traffic

Online: NVIDIA DGX-1 (1x V100 32GB), NVIDIA TensorRT with FP16, Dataset: electricity

Online: NVIDIA DGX-1 (1x V100 32GB), NVIDIA TensorRT with FP16, Dataset: traffic

Online: NVIDIA DGX-1 (1x V100 32GB), PyTorch with FP16, Dataset: electricity

Online: NVIDIA DGX-1 (1x V100 32GB), PyTorch with FP16, Dataset: traffic

Online: NVIDIA DGX A100 (1x A100 80GB), NVIDIA TensorRT with FP16, Dataset: electricity

Online: NVIDIA DGX A100 (1x A100 80GB), NVIDIA TensorRT with FP16, Dataset: traffic

Online: NVIDIA DGX A100 (1x A100 80GB), PyTorch with FP16, Dataset: electricity

Online: NVIDIA DGX A100 (1x A100 80GB), PyTorch with FP16, Dataset: traffic

Online: NVIDIA T4, NVIDIA TensorRT with FP16, Dataset: electricity

Online: NVIDIA T4, NVIDIA TensorRT with FP16, Dataset: traffic

Online: NVIDIA T4, PyTorch with FP16, Dataset: electricity

Online: NVIDIA T4, PyTorch with FP16, Dataset: traffic

Advanced

Step by step deployment process

Clone Repository

Setup Environment

Prepare Dataset.

Prepare Checkpoint

Setup Container

Prepare configuration

Export Model

Convert Model

Deploy Model

Prepare Triton Profiling Data

Triton Performance Offline Test

Triton Performance Online Test

Latency explanation

Release Notes

Changelog

Known issues