|
1 | | -# Tacotron 2 And WaveGlow v1.7 For PyTorch |
| 1 | +# Tacotron 2 And WaveGlow v1.10 For PyTorch |
2 | 2 |
|
3 | 3 | This repository provides a script and recipe to train Tacotron 2 and WaveGlow |
4 | 4 | v1.6 models to achieve state of the art accuracy, and is tested and maintained by NVIDIA. |
@@ -33,13 +33,13 @@ v1.6 models to achieve state of the art accuracy, and is tested and maintained b |
33 | 33 | * [Inference performance benchmark](#inference-performance-benchmark) |
34 | 34 | * [Results](#results) |
35 | 35 | * [Training accuracy results](#training-accuracy-results) |
36 | | - * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g) |
| 36 | + * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-8x-v100-16g) |
37 | 37 | * [Training performance results](#training-performance-results) |
38 | | - * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g) |
| 38 | + * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g) |
39 | 39 | * [Expected training time](#expected-training-time) |
40 | 40 | * [Inference performance results](#inference-performance-results) |
41 | | - * [NVIDIA V100 16G](#nvidia-v100-16g) |
42 | | - * [NVIDIA T4](#nvidia-t4) |
| 41 | + * [Inference performance: NVIDIA V100 16G](#inference-performance-nvidia-v100-16g) |
| 42 | + * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4) |
43 | 43 | * [Release notes](#release-notes) |
44 | 44 | * [Changelog](#changelog) |
45 | 45 | * [Known issues](#known-issues) |
@@ -471,7 +471,7 @@ To run inference, issue: |
471 | 471 | ```bash |
472 | 472 | python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase.txt --amp-run |
473 | 473 | ``` |
474 | | -Here, `Tacotron2_checkpoint` and `WaveGlow_checkpoint` are pre-trained |
| 474 | +Here, `Tacotron2_checkpoint` and `WaveGlow_checkpoint` are pre-trained |
475 | 475 | checkpoints for the respective models, and `phrases/phrase.txt` contains input |
476 | 476 | phrases. The number of text lines determines the inference batch size. Audio |
477 | 477 | will be saved in the output folder. The audio files [audio_fp16](./audio/audio_fp16.wav) |
@@ -564,7 +564,7 @@ and accuracy in training and inference. |
564 | 564 |
|
565 | 565 | #### Training accuracy results |
566 | 566 |
|
567 | | -##### NVIDIA DGX-1 (8x V100 16G) |
| 567 | +##### Training accuracy: NVIDIA DGX-1 (8x V100 16G) |
568 | 568 |
|
569 | 569 | Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{AMP,FP32}_DGX1_16GB_8GPU.sh` training script in the PyTorch-19.06-py3 |
570 | 570 | NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. |
@@ -594,7 +594,7 @@ WaveGlow FP32 loss - batch size 4 (mean and std over 16 runs) |
594 | 594 |
|
595 | 595 | #### Training performance results |
596 | 596 |
|
597 | | -##### NVIDIA DGX-1 (8x V100 16G) |
| 597 | +##### Training performance: NVIDIA DGX-1 (8x V100 16G) |
598 | 598 |
|
599 | 599 | Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{AMP,FP32}_DGX1_16GB_8GPU.sh` |
600 | 600 | training script in the PyTorch-19.06-py3 NGC container on NVIDIA DGX-1 with |
@@ -648,26 +648,27 @@ deviation, and latency confidence intervals. Throughput is measured |
648 | 648 | as the number of generated audio samples per second. RTF is the real-time factor |
649 | 649 | which tells how many seconds of speech are generated in 1 second of compute. |
650 | 650 |
|
651 | | -##### NVIDIA V100 16G |
| 651 | +##### Inference performance: NVIDIA DGX-1 (1x V100 16G) |
652 | 652 |
|
653 | | -|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)|Latency confidence interval 50% (s)|Latency confidence interval 100% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg mels generated (81 mels=1 sec of speech)|Avg audio length (s)|Avg RTF| |
654 | | -|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| |
655 | | -|1| 128| FP16| 1.73| 0.07| 1.72| 2.11| 89,162| 1.09| 601| 6.98| 4.04| |
656 | | -|4| 128| FP16| 4.21| 0.17| 4.19| 4.84| 145,800| 1.16| 600| 6.97| 1.65| |
657 | | -|1| 128| FP32| 1.85| 0.06| 1.84| 2.19| 81,868| 1.00| 590| 6.85| 3.71| |
658 | | -|4| 128| FP32| 4.80| 0.15| 4.79| 5.43| 125,930| 1.00| 590| 6.85| 1.43| |
| 653 | +|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)|Latency confidence interval 90% (s)|Latency confidence interval 95% (s)|Latency confidence interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg mels generated (81 mels=1 sec of speech)|Avg audio length (s)|Avg RTF| |
| 654 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| |
| 655 | +|1| 128| FP16| 1.27| 0.06| 1.34| 1.38| 1.41| 121,190| 1.37| 603| 7.00| 5.51| |
| 656 | +|4| 128| FP16| 2.32| 0.09| 2.42| 2.45| 2.59| 277,711| 2.03| 628| 7.23| 3.12| |
| 657 | +|1| 128| FP32| 1.70| 0.05| 1.77| 1.79| 1.84| 88,650| 1.00| 590| 6.85| 4.03| |
| 658 | +|4| 128| FP32| 4.56| 0.12| 4.72| 4.77| 4.87| 136,518| 1.00| 608| 7.06| 1.55| |
659 | 659 |
|
660 | | -##### NVIDIA T4 |
| 660 | +##### Inference performance: NVIDIA T4 |
| 661 | + |
| 662 | +|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)|Latency confidence interval 90% (s)|Latency confidence interval 95% (s)|Latency confidence interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg mels generated (81 mels=1 sec of speech)|Avg audio length (s)|Avg RTF| |
| 663 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| |
| 664 | +|1| 128| FP16| 3.13| 0.13| 3.28| 3.36| 3.46| 49,276| 1.26| 602| 6.99| 2.24| |
| 665 | +|4| 128| FP16| 11.98| 0.42| 12.44| 12.70| 13.29| 53,676| 1.23| 628| 7.29| 0.61| |
| 666 | +|1| 128| FP32| 3.88| 0.12| 4.04| 4.09| 4.19| 38,964| 1.00| 591| 6.86| 1.77| |
| 667 | +|4| 128| FP32| 14.34| 0.42| 14.89| 15.08| 15.55| 43,489| 1.00| 609| 7.07| 0.49| |
661 | 668 |
|
662 | | -|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)|Latency confidence interval 50% (s)|Latency confidence interval 100% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg mels generated (81 mels=1 sec of speech)|Avg audio length (s)|Avg RTF| |
663 | | -|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| |
664 | | -|1| 128| FP16| 3.16| 0.13| 3.16| 3.81| 48,792| 1.23| 603| 7.00| 2.21| |
665 | | -|4| 128| FP16| 11.45| 0.49| 11.39| 14.38| 53,771| 1.22| 601| 6.98| 0.61| |
666 | | -|1| 128| FP32| 3.82| 0.11| 3.81| 4.24| 39,603| 1.00| 591| 6.86| 1.80| |
667 | | -|4| 128| FP32| 13.80| 0.45| 13.74| 16.09| 43,915| 1.00| 592| 6.87| 0.50| |
668 | 669 |
|
669 | 670 | Our results were obtained by running the `./run_latency_tests.sh` script in |
670 | | -the PyTorch-19.06-py3 NGC container. Please note that to reproduce the results, |
| 671 | +the PyTorch-19.09-py3 NGC container. Please note that to reproduce the results, |
671 | 672 | you need to provide pretrained checkpoints for Tacotron 2 and WaveGlow. Please |
672 | 673 | edit the script to provide your checkpoint filenames. |
673 | 674 |
|
@@ -696,7 +697,13 @@ August 2019 |
696 | 697 | September 2019 |
697 | 698 | * Introduced inference statistics |
698 | 699 |
|
| 700 | +October 2019 |
| 701 | +* Tacotron 2 inference with torch.jit.script |
| 702 | + |
| 703 | +November 2019 |
| 704 | +* Implemented training resume from checkpoint |
| 705 | +* Added notebook for running Tacotron 2 and WaveGlow in TRTIS. |
| 706 | + |
699 | 707 | ### Known issues |
700 | 708 |
|
701 | 709 | There are no known issues in this release. |
702 | | - |
|
0 commit comments