Skip to content

Commit b07f501

Browse files
authored
Merge pull request NVIDIA#199 from NVIDIA/nvpstr/release19.08_1
[Tacotron2] Added denoiser and inference stats, fixed typos
2 parents da8acb1 + 02b49ac commit b07f501

9 files changed

Lines changed: 545 additions & 66 deletions

File tree

PyTorch/SpeechSynthesis/Tacotron2/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM nvcr.io/nvidia/pytorch:19.07-py3
1+
FROM nvcr.io/nvidia/pytorch:19.08-py3
22

33
ADD . /workspace/tacotron2
44
WORKDIR /workspace/tacotron2

PyTorch/SpeechSynthesis/Tacotron2/README.md

Lines changed: 73 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Tacotron 2 And WaveGlow v1.6 For PyTorch
1+
# Tacotron 2 And WaveGlow v1.7 For PyTorch
22

33
This repository provides a script and recipe to train Tacotron 2 and WaveGlow
44
v1.6 models to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
@@ -38,7 +38,8 @@ v1.6 models to achieve state of the art accuracy, and is tested and maintained b
3838
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
3939
* [Expected training time](#expected-training-time)
4040
* [Inference performance results](#inference-performance-results)
41-
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
41+
* [NVIDIA V100 16G](#nvidia-v100-16g)
42+
* [NVIDIA T4](#nvidia-t4)
4243
* [Release notes](#release-notes)
4344
* [Changelog](#changelog)
4445
* [Known issues](#known-issues)
@@ -99,7 +100,7 @@ into spherical Gaussian distribution through a series of flows. One step of a
99100
flow consists of an invertible convolution, followed by a modified WaveNet
100101
architecture that serves as an affine coupling layer. During inference, the
101102
network is inverted and audio samples are generated from the Gaussian
102-
distribution.
103+
distribution. Our implementation uses 512 residual channels in the coupling layer.
103104

104105
![](./img/waveglow_arch.png "WaveGlow architecture")
105106

@@ -130,16 +131,16 @@ The following features are supported by this model.
130131
|[AMP](https://nvidia.github.io/apex/amp.html) | Yes | Yes |
131132
|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html) | Yes | Yes |
132133

133-
#### Features
134+
#### Features
134135

135-
AMP - a tool that enables Tensor Core-accelerated training. For more information,
136+
AMP - a tool that enables Tensor Core-accelerated training. For more information,
136137
refer to [Enabling mixed precision](#enabling-mixed-precision).
137138

138-
Apex DistributedDataParallel - a module wrapper that enables easy multiprocess
139-
distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`.
140-
`DistributedDataParallel` is optimized for use with NCCL. It achieves high
141-
performance by overlapping communication with computation during `backward()`
142-
and bucketing smaller gradient transfers to reduce the total number of transfers
139+
Apex DistributedDataParallel - a module wrapper that enables easy multiprocess
140+
distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`.
141+
`DistributedDataParallel` is optimized for use with NCCL. It achieves high
142+
performance by overlapping communication with computation during `backward()`
143+
and bucketing smaller gradient transfers to reduce the total number of transfers
143144
required.
144145

145146
## Mixed precision training
@@ -267,16 +268,9 @@ this script, issue:
267268
bash scripts/prepare_dataset.sh
268269
```
269270

270-
To preprocess the datasets for Tacotron 2 training, use the
271-
`./scripts/prepare_mels.sh` script:
272-
```bash
273-
bash scripts/prepare_mels.sh
274-
```
275-
276271
Data is downloaded to the `./LJSpeech-1.1` directory (on the host). The
277-
`./LJSpeech-1.1` directory is mounted to the `/workspace/tacotron2/LJSpeech-1.1`
278-
location in the NGC container. The preprocessed mel-spectrograms are stored in the
279-
`./LJSpeech-1.1/mels` directory.
272+
`./LJSpeech-1.1` directory is mounted to the `/workspace/tacotron2/LJSpeech-1.1`
273+
location in the NGC container.
280274

281275
3. Build the Tacotron 2 and WaveGlow PyTorch NGC container.
282276
```bash
@@ -290,8 +284,14 @@ After you build the container image, you can start an interactive CLI session wi
290284
bash scripts/docker/interactive.sh
291285
```
292286

293-
The `interactive.sh` script requires that the location on the dataset is specified.
294-
For example, `LJSpeech-1.1`.
287+
The `interactive.sh` script requires that the location on the dataset is specified.
288+
For example, `LJSpeech-1.1`. To preprocess the datasets for Tacotron 2 training, use
289+
the `./scripts/prepare_mels.sh` script:
290+
```bash
291+
bash scripts/prepare_mels.sh
292+
```
293+
294+
The preprocessed mel-spectrograms are stored in the `./LJSpeech-1.1/mels` directory.
295295

296296
5. Start training.
297297
To start Tacotron 2 training, run:
@@ -313,8 +313,8 @@ Ensure your loss values are comparable to those listed in the table in the
313313
samples in the `./audio` folder. For details about generating audio, see the
314314
[Inference process](#inference-process) section below.
315315

316-
The training scripts automatically run the validation after each training
317-
epoch. The results from the validation are printed to the standard output
316+
The training scripts automatically run the validation after each training
317+
epoch. The results from the validation are printed to the standard output
318318
(`stdout`) and saved to the log files.
319319

320320
7. Start inference.
@@ -327,10 +327,10 @@ and `--waveglow` arguments.
327327
```bash
328328
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i phrases/phrase.txt --amp-run
329329
```
330-
331-
The speech is generated from lines of text in the file that is passed with
332-
`-i` argument. The number of lines determines inference batch size. To run
333-
inference in mixed precision, use the `--amp-run` flag. The output audio will
330+
331+
The speech is generated from lines of text in the file that is passed with
332+
`-i` argument. The number of lines determines inference batch size. To run
333+
inference in mixed precision, use the `--amp-run` flag. The output audio will
334334
be stored in the path specified by the `-o` argument.
335335

336336
## Advanced
@@ -390,11 +390,12 @@ WaveGlow models.
390390
#### WaveGlow parameters
391391

392392
* `--segment-length` - segment length of input audio processed by the neural network (8000)
393+
* `--wn-channels` - number of residual channels in the coupling layer networks (512)
393394

394395

395396
### Command-line options
396397

397-
To see the full list of available options and their descriptions, use the `-h`
398+
To see the full list of available options and their descriptions, use the `-h`
398399
or `--help` command line option, for example:
399400
```bash
400401
python train.py --help
@@ -470,8 +471,12 @@ To run inference, issue:
470471
```bash
471472
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase.txt --amp-run
472473
```
473-
Here, `Tacotron2_checkpoint` and `WaveGlow_checkpoint` are pre-trained
474-
checkpoints for the respective models, and `phrases/phrase.txt` contains input phrases. The number of text lines determines the inference batch size. Audio will be saved in the output folder.
474+
Here, `Tacotron2_checkpoint` and `WaveGlow_checkpoint` are pre-trained
475+
checkpoints for the respective models, and `phrases/phrase.txt` contains input
476+
phrases. The number of text lines determines the inference batch size. Audio
477+
will be saved in the output folder. The audio files [audio_fp16](./audio/audio_fp16.wav)
478+
and [audio_fp32](./audio/audio_fp32.wav) were generated using checkpoints from
479+
mixed precision and FP32 training, respectively.
475480

476481
You can find all the available options by calling `python inference.py --help`.
477482

@@ -548,9 +553,9 @@ To benchmark the inference performance on a batch size=1, run:
548553
```
549554

550555
The output log files will contain performance numbers for Tacotron 2 model
551-
(number of output mel-spectrograms per second, reported as `tacotron2_items_per_sec`)
552-
and for WaveGlow (number of output samples per second, reported as `waveglow_items_per_sec`).
553-
The `inference.py` script will run a few warmup iterations before running the benchmark.
556+
(number of output mel-spectrograms per second, reported as `tacotron2_items_per_sec`)
557+
and for WaveGlow (number of output samples per second, reported as `waveglow_items_per_sec`).
558+
The `inference.py` script will run a few warmup iterations before running the benchmark.
554559

555560
### Results
556561

@@ -635,31 +640,36 @@ The following table shows the expected training time for convergence for WaveGlo
635640

636641
#### Inference performance results
637642

638-
##### NVIDIA DGX-1 (8x V100 16G)
639-
640-
Our results were obtained by running the `./inference.py` inference script in
641-
the PyTorch-19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
642-
Performance numbers (in output mel-spectrograms per second for Tacotron 2 and
643-
output samples per second for WaveGlow) were averaged over 16 runs.
644-
645-
The following table shows the inference performance results for Tacotron 2 model.
646-
Results are measured in the number of output mel-spectrograms per second.
647-
648-
|Number of GPUs|Number of mels used with mixed precision|Number of mels used with FP32|Speed-up with mixed precision|
649-
|---:|---:|---:|---:|
650-
|**1**|625|613|1.02|
651-
652-
The following table shows the inference performance results for WaveGlow model.
653-
Results are measured in the number of output samples per second<sup>1</sup>.
654-
655-
|Number of GPUs|Number of samples used with mixed precision|Number of samples used with FP32|Speed-up with mixed precision|
656-
|---:|---:|---:|---:|
657-
|**1**|180474|162282|1.11|
658-
659-
<sup>1</sup>With sampling rate equal to 22050, one second of audio is generated from 22050 samples.
660-
661-
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
662-
643+
The following tables show inference statistics for the Tacotron2 and WaveGlow
644+
text-to-speech system, gathered from 1000 inference runs, on 1 V100 and 1 T4,
645+
respectively. Latency is measured from the start of Tacotron 2 inference to
646+
the end of WaveGlow inference. The tables include average latency, latency standard
647+
deviation, and latency confidence intervals. Throughput is measured
648+
as the number of generated audio samples per second. RTF is the real-time factor
649+
which tells how many seconds of speech are generated in 1 second of compute.
650+
651+
##### NVIDIA V100 16G
652+
653+
|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)|Latency confidence interval 50% (s)|Latency confidence interval 100% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg mels generated (81 mels=1 sec of speech)|Avg audio length (s)|Avg RTF|
654+
|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
655+
|1| 128| FP16| 1.73| 0.07| 1.72| 2.11| 89,162| 1.09| 601| 6.98| 4.04|
656+
|4| 128| FP16| 4.21| 0.17| 4.19| 4.84| 145,800| 1.16| 600| 6.97| 1.65|
657+
|1| 128| FP32| 1.85| 0.06| 1.84| 2.19| 81,868| 1.00| 590| 6.85| 3.71|
658+
|4| 128| FP32| 4.80| 0.15| 4.79| 5.43| 125,930| 1.00| 590| 6.85| 1.43|
659+
660+
##### NVIDIA T4
661+
662+
|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)|Latency confidence interval 50% (s)|Latency confidence interval 100% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg mels generated (81 mels=1 sec of speech)|Avg audio length (s)|Avg RTF|
663+
|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
664+
|1| 128| FP16| 3.16| 0.13| 3.16| 3.81| 48,792| 1.23| 603| 7.00| 2.21|
665+
|4| 128| FP16| 11.45| 0.49| 11.39| 14.38| 53,771| 1.22| 601| 6.98| 0.61|
666+
|1| 128| FP32| 3.82| 0.11| 3.81| 4.24| 39,603| 1.00| 591| 6.86| 1.80|
667+
|4| 128| FP32| 13.80| 0.45| 13.74| 16.09| 43,915| 1.00| 592| 6.87| 0.50|
668+
669+
Our results were obtained by running the `./run_latency_tests.sh` script in
670+
the PyTorch-19.06-py3 NGC container. Please note that to reproduce the results,
671+
you need to provide pretrained checkpoints for Tacotron 2 and WaveGlow. Please
672+
edit the script to provide your checkpoint filenames.
663673

664674
## Release notes
665675

@@ -674,7 +684,7 @@ June 2019
674684
* Fixed dropouts on LSTMCells
675685

676686
July 2019
677-
* Changed measurement units for Tacotron 2 training and inference performance
687+
* Changed measurement units for Tacotron 2 training and inference performance
678688
benchmarks from input tokes per second to output mel-spectrograms per second
679689
* Introduced batched inference
680690
* Included warmup in the inference script
@@ -683,6 +693,10 @@ August 2019
683693
* Fixed inference results
684694
* Fixed initialization of Batch Normalization
685695

696+
September 2019
697+
* Introduced inference statistics
698+
686699
### Known issues
687700

688701
There are no known issues in this release.
702+

PyTorch/SpeechSynthesis/Tacotron2/common/stft.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ def inverse(self, magnitude, phase):
124124
np.where(window_sum > tiny(window_sum))[0])
125125
window_sum = torch.autograd.Variable(
126126
torch.from_numpy(window_sum), requires_grad=False)
127+
window_sum = window_sum.cuda() if magnitude.is_cuda else window_sum
127128
inverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices]
128129

129130
# scale by hop ratio

PyTorch/SpeechSynthesis/Tacotron2/inference.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@
4141

4242
from apex import amp
4343

44+
from waveglow.denoiser import Denoiser
45+
4446
def parse_args(parser):
4547
"""
4648
Parse commandline arguments.
@@ -53,7 +55,8 @@ def parse_args(parser):
5355
help='full path to the Tacotron2 model checkpoint file')
5456
parser.add_argument('--waveglow', type=str,
5557
help='full path to the WaveGlow model checkpoint file')
56-
parser.add_argument('-s', '--sigma-infer', default=0.6, type=float)
58+
parser.add_argument('-s', '--sigma-infer', default=0.9, type=float)
59+
parser.add_argument('-d', '--denoising-strength', default=0.01, type=float)
5760
parser.add_argument('-sr', '--sampling-rate', default=22050, type=int,
5861
help='Sampling rate')
5962
parser.add_argument('--amp-run', action='store_true',
@@ -212,6 +215,7 @@ def main():
212215
args.amp_run)
213216
waveglow = load_and_setup_model('WaveGlow', parser, args.waveglow,
214217
args.amp_run)
218+
denoiser = Denoiser(waveglow).cuda()
215219

216220
texts = []
217221
try:
@@ -242,6 +246,7 @@ def main():
242246
with torch.no_grad(), MeasureTime(measurements, "waveglow_time"):
243247
audios = waveglow.infer(mel, sigma=args.sigma_infer)
244248
audios = audios.float()
249+
audios = denoiser(audios, strength=args.denoising_strength).squeeze(1)
245250

246251
tacotron2_infer_perf = mel.size(0)*mel.size(2)/measurements['tacotron2_time']
247252
waveglow_infer_perf = audios.size(0)*audios.size(1)/measurements['waveglow_time']
@@ -254,9 +259,10 @@ def main():
254259
measurements['waveglow_time']))
255260

256261
for i, audio in enumerate(audios):
262+
audio = audio[:mel_lengths[i]*args.stft_hop_length]
263+
audio = audio/torch.max(torch.abs(audio))
257264
audio_path = args.output + "audio_"+str(i)+".wav"
258-
write(audio_path, args.sampling_rate,
259-
audio.data.cpu().numpy()[:mel_lengths[i]*args.stft_hop_length])
265+
write(audio_path, args.sampling_rate, audio.cpu().numpy())
260266

261267
LOGGER.iteration_stop()
262268
LOGGER.finish()
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
bash test_infer.sh -bs 1 -il 128 -p amp --num-iters 1003 --tacotron2 checkpoint_Tacotron2_amp --waveglow checkpoint_WaveGlow_amp
2+
bash test_infer.sh -bs 4 -il 128 -p amp --num-iters 1003 --tacotron2 checkpoint_Tacotron2_amp --waveglow checkpoint_WaveGlow_amp
3+
bash test_infer.sh -bs 1 -il 128 -p fp32 --num-iters 1003 --tacotron2 checkpoint_Tacotron2_fp32 --waveglow checkpoint_WaveGlow_fp32
4+
bash test_infer.sh -bs 4 -il 128 -p fp32 --num-iters 1003 --tacotron2 checkpoint_Tacotron2_fp32 --waveglow checkpoint_WaveGlow_fp32
5+

PyTorch/SpeechSynthesis/Tacotron2/tacotron2/model.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -491,9 +491,6 @@ def infer(self, memory, memory_lengths):
491491
decoder_input = self.prenet(decoder_input, inference=True)
492492
mel_output, gate_output, alignment = self.decode(decoder_input)
493493

494-
mel_outputs += [mel_output.squeeze(1)]
495-
gate_outputs += [gate_output]
496-
alignments += [alignment]
497494
dec = torch.le(torch.sigmoid(gate_output.data),
498495
self.gate_threshold).to(torch.int32).squeeze(1)
499496

@@ -502,6 +499,11 @@ def infer(self, memory, memory_lengths):
502499

503500
if self.early_stopping and torch.sum(not_finished) == 0:
504501
break
502+
503+
mel_outputs += [mel_output.squeeze(1)]
504+
gate_outputs += [gate_output]
505+
alignments += [alignment]
506+
505507
if len(mel_outputs) == self.max_decoder_steps:
506508
print("Warning! Reached max decoder steps")
507509
break

0 commit comments

Comments
 (0)