Skip to content

Commit 77a1bb9

Browse files
committed
[Tacotron2/PyT] Updates: better perf, better trt7 support, new logging, bug fixes
1 parent 155578a commit 77a1bb9

25 files changed

Lines changed: 225 additions & 526 deletions
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
__pycache__/
2+
/checkpoints/
3+
/output/
4+
nvlog.json
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
FROM nvcr.io/nvidia/pytorch:19.11-py3
1+
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.01-py3
2+
FROM ${FROM_IMAGE_NAME}
23

34
ADD . /workspace/tacotron2
45
WORKDIR /workspace/tacotron2
5-
RUN pip install -r requirements.txt
6-
RUN pip --no-cache-dir --no-cache install 'git+https://github.com/NVIDIA/dllogger'
6+
RUN pip install --no-cache-dir -r requirements.txt

PyTorch/SpeechSynthesis/Tacotron2/Dockerfile_trtis_client

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
# See the License for the specific language governing permissions and
1212
# limitations under the License.
1313

14-
FROM nvcr.io/nvidia/tensorrtserver:19.10-py3-clientsdk AS trt
14+
FROM nvcr.io/nvidia/tensorrtserver:20.01-py3-clientsdk AS trt
1515
FROM continuumio/miniconda3
1616
RUN apt-get update && apt-get install -y pbzip2 pv bzip2 cabextract mc iputils-ping wget
1717

PyTorch/SpeechSynthesis/Tacotron2/README.md

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ and encapsulates some dependencies. Aside from these dependencies, ensure you
231231
have the following components:
232232

233233
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
234-
* [PyTorch 19.06-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
234+
* [PyTorch 20.01-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
235235
or newer
236236
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
237237

@@ -370,7 +370,7 @@ WaveGlow models.
370370

371371
* `--epochs` - number of epochs (Tacotron 2: 1501, WaveGlow: 1001)
372372
* `--learning-rate` - learning rate (Tacotron 2: 1e-3, WaveGlow: 1e-4)
373-
* `--batch-size` - batch size (Tacotron 2 FP16/FP32: 128/64, WaveGlow FP16/FP32: 10/4)
373+
* `--batch-size` - batch size (Tacotron 2 FP16/FP32: 104/48, WaveGlow FP16/FP32: 10/4)
374374
* `--amp-run` - use mixed precision training
375375

376376
#### Shared audio/STFT parameters
@@ -496,21 +496,21 @@ To benchmark the training performance on a specific batch size, run:
496496
* For 1 GPU
497497
* FP16
498498
```bash
499-
python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_2500_filelist.txt --dataset-path <dataset-path> --amp-run
499+
python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path> --amp-run
500500
```
501501
* FP32
502502
```bash
503-
python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_2500_filelist.txt --dataset-path <dataset-path>
503+
python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path>
504504
```
505505

506506
* For multiple GPUs
507507
* FP16
508508
```bash
509-
python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_2500_filelist.txt --dataset-path <dataset-path> --amp-run
509+
python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path> --amp-run
510510
```
511511
* FP32
512512
```bash
513-
python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_2500_filelist.txt --dataset-path <dataset-path>
513+
python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path>
514514
```
515515

516516
**WaveGlow**
@@ -579,10 +579,10 @@ All of the results were produced using the `train.py` script as described in the
579579
| WaveGlow FP16 | -2.2054 | -5.7602 | -5.901 | -5.9706 | -6.0258 |
580580
| WaveGlow FP32 | -3.0327 | -5.858 | -6.0056 | -6.0613 | -6.1087 |
581581

582-
Tacotron 2 FP16 loss - batch size 128 (mean and std over 16 runs)
582+
Tacotron 2 FP16 loss - batch size 104 (mean and std over 16 runs)
583583
![](./img/tacotron2_amp_loss.png "Tacotron 2 FP16 loss")
584584

585-
Tacotron 2 FP32 loss - batch size 64 (mean and std over 16 runs)
585+
Tacotron 2 FP32 loss - batch size 48 (mean and std over 16 runs)
586586
![](./img/tacotron2_fp32_loss.png "Tacotron 2 FP16 loss")
587587

588588
WaveGlow FP16 loss - batch size 10 (mean and std over 16 runs)
@@ -597,7 +597,7 @@ WaveGlow FP32 loss - batch size 4 (mean and std over 16 runs)
597597
##### Training performance: NVIDIA DGX-1 (8x V100 16G)
598598

599599
Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{AMP,FP32}_DGX1_16GB_8GPU.sh`
600-
training script in the PyTorch-19.06-py3 NGC container on NVIDIA DGX-1 with
600+
training script in the PyTorch-19.12-py3 NGC container on NVIDIA DGX-1 with
601601
8x V100 16G GPUs. Performance numbers (in output mel-spectrograms per second for
602602
Tacotron 2 and output samples per second for WaveGlow) were averaged over
603603
an entire training epoch.
@@ -606,9 +606,9 @@ This table shows the results for Tacotron 2:
606606

607607
|Number of GPUs|Batch size per GPU|Number of mels used with mixed precision|Number of mels used with FP32|Speed-up with mixed precision|Multi-GPU weak scaling with mixed precision|Multi-GPU weak scaling with FP32|
608608
|---:|---:|---:|---:|---:|---:|---:|
609-
|1|128@FP16, 64@FP32 | 20,992 | 12,933 | 1.62 | 1.00 | 1.00 |
610-
|4|128@FP16, 64@FP32 | 74,989 | 46,115 | 1.63 | 3.57 | 3.57 |
611-
|8|128@FP16, 64@FP32 | 140,060 | 88,719 | 1.58 | 6.67 | 6.86 |
609+
|1|104@FP16, 48@FP32 | 15,313 | 9,674 | 1.58 | 1.00 | 1.00 |
610+
|4|104@FP16, 48@FP32 | 53,661 | 32,778 | 1.64 | 3.50 | 3.39 |
611+
|8|104@FP16, 48@FP32 | 100,422 | 59,549 | 1.69 | 6.56 | 6.16 |
612612

613613
The following table shows the results for WaveGlow:
614614

@@ -626,9 +626,9 @@ The following table shows the expected training time for convergence for Tacotro
626626

627627
|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
628628
|---:|---:|---:|---:|---:|
629-
|1| 128@FP16, 64@FP32 | 153 | 234 | 1.53 |
630-
|4| 128@FP16, 64@FP32 | 42 | 64 | 1.54 |
631-
|8| 128@FP16, 64@FP32 | 22 | 33 | 1.52 |
629+
|1| 104@FP16, 48@FP32 | 193 | 312 | 1.62 |
630+
|4| 104@FP16, 48@FP32 | 53 | 85 | 1.58 |
631+
|8| 104@FP16, 48@FP32 | 31 | 45 | 1.47 |
632632

633633
The following table shows the expected training time for convergence for WaveGlow (1001 epochs):
634634

@@ -704,8 +704,11 @@ November 2019
704704
* Implemented training resume from checkpoint
705705
* Added notebook for running Tacotron 2 and WaveGlow in TRTIS.
706706

707-
December 2019
708-
* Added `trt` subfolder for running Tacotron 2 and WaveGlow in TensorRT.
707+
December 2019
708+
* Added export and inference scripts for TensorRT. See [Tacotron2 TensorRT README](trt/README.md).
709+
710+
January 2020
711+
* Updated batch sizes and performance results for Tacotron 2.
709712

710713
### Known issues
711714

PyTorch/SpeechSynthesis/Tacotron2/common/stft.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ def __init__(self, filter_length=800, hop_length=200, win_length=800,
5858

5959
forward_basis = torch.FloatTensor(fourier_basis[:, None, :])
6060
inverse_basis = torch.FloatTensor(
61-
np.linalg.pinv(scale * fourier_basis).T[:, None, :])
61+
np.linalg.pinv(scale * fourier_basis).T[:, None, :].astype(np.float32))
6262

6363
if window is not None:
6464
assert(filter_length >= win_length)

PyTorch/SpeechSynthesis/Tacotron2/exports/export_tacotron2_ts.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@
2727

2828
import torch
2929
import argparse
30+
import sys
31+
sys.path.append('./')
3032
from inference import checkpoint_from_distributed, unwrap_distributed, load_and_setup_model
3133

3234
def parse_args(parser):

PyTorch/SpeechSynthesis/Tacotron2/exports/export_waveglow_onnx.py

Lines changed: 47 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
#
2626
# *****************************************************************************
2727

28+
import types
2829
import torch
2930
import argparse
3031

@@ -113,32 +114,52 @@ def convert_1d_to_2d_(glow):
113114

114115
glow.cuda()
115116

116-
def test_inference(waveglow):
117117

118+
def infer_onnx(self, spect, z, sigma=0.9):
118119

119-
from scipy.io.wavfile import write
120+
spect = self.upsample(spect)
121+
# trim conv artifacts. maybe pad spec to kernel multiple
122+
time_cutoff = self.upsample.kernel_size[0] - self.upsample.stride[0]
123+
spect = spect[:, :, :-time_cutoff]
120124

121-
mel = torch.load("mel.pt").cuda()
122-
# mel = torch.load("mel_spectrograms/LJ001-0015.wav.pt").cuda()
123-
# mel = mel.unsqueeze(0)
124-
mel_lengths = [mel.size(2)]
125-
stride = 256
126-
kernel_size = 1024
127-
n_group = 8
128-
z_size2 = (mel.size(2)-1)*stride+(kernel_size-1)+1
129-
# corresponds to cutoff in infer_onnx
130-
z_size2 = z_size2 - (kernel_size-stride)
131-
z_size2 = z_size2//n_group
132-
z = torch.randn(1, n_group, z_size2, 1).cuda()
133-
mel = mel.unsqueeze(3)
125+
length_spect_group = spect.size(2)//8
126+
mel_dim = 80
127+
batch_size = spect.size(0)
134128

135-
with torch.no_grad():
136-
audios = waveglow(mel, z)
129+
spect = torch.squeeze(spect, 3)
130+
spect = spect.view((batch_size, mel_dim, length_spect_group, self.n_group))
131+
spect = spect.permute(0, 2, 1, 3)
132+
spect = spect.contiguous()
133+
spect = spect.view((batch_size, length_spect_group, self.n_group*mel_dim))
134+
spect = spect.permute(0, 2, 1)
135+
spect = torch.unsqueeze(spect, 3)
136+
spect = spect.contiguous()
137+
138+
audio = z[:, :self.n_remaining_channels, :, :]
139+
z = z[:, self.n_remaining_channels:self.n_group, :, :]
140+
audio = sigma*audio
141+
142+
for k in reversed(range(self.n_flows)):
143+
n_half = int(audio.size(1) / 2)
144+
audio_0 = audio[:, :n_half, :, :]
145+
audio_1 = audio[:, n_half:(n_half+n_half), :, :]
137146

138-
for i, audio in enumerate(audios):
139-
audio = audio[:mel_lengths[i]*256]
140-
audio = audio/torch.max(torch.abs(audio))
141-
write("audio_pyt.wav", 22050, audio.cpu().numpy())
147+
output = self.WN[k]((audio_0, spect))
148+
s = output[:, n_half:(n_half+n_half), :, :]
149+
b = output[:, :n_half, :, :]
150+
audio_1 = (audio_1 - b) / torch.exp(s)
151+
audio = torch.cat([audio_0, audio_1], 1)
152+
153+
audio = self.convinv[k](audio)
154+
155+
if k % self.n_early_every == 0 and k > 0:
156+
audio = torch.cat((z[:, :self.n_early_size, :, :], audio), 1)
157+
z = z[:, self.n_early_size:self.n_group, :, :]
158+
159+
audio = torch.squeeze(audio, 3)
160+
audio = audio.permute(0,2,1).contiguous().view(batch_size, (length_spect_group * self.n_group))
161+
162+
return audio
142163

143164

144165
def export_onnx(parser, args):
@@ -166,12 +187,16 @@ def export_onnx(parser, args):
166187

167188
# export to ONNX
168189
convert_1d_to_2d_(waveglow)
169-
waveglow.forward = waveglow.infer_onnx
190+
191+
fType = types.MethodType
192+
waveglow.forward = fType(infer_onnx, waveglow)
193+
170194
if args.amp_run:
171195
waveglow.half()
172196
mel = mel.unsqueeze(3)
173197

174198
opset_version = 10
199+
175200
torch.onnx.export(waveglow, (mel, z), args.output+"/"+"waveglow.onnx",
176201
opset_version=opset_version,
177202
do_constant_folding=True,
@@ -181,8 +206,6 @@ def export_onnx(parser, args):
181206
"z": {0: "batch_size", 2: "z_seq"},
182207
"audio": {0: "batch_size", 1: "audio_seq"}})
183208

184-
test_inference(waveglow)
185-
186209

187210
def main():
188211

PyTorch/SpeechSynthesis/Tacotron2/exports/export_waveglow_trt_config.py

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -64,20 +64,24 @@ def main():
6464
config_template = r"""
6565
name: "{model_name}"
6666
platform: "tensorrt_plan"
67+
default_model_filename: "waveglow_fp16.engine"
68+
69+
max_batch_size: 1
70+
6771
input {{
68-
name: "0"
72+
name: "mel"
6973
data_type: {fp_type}
70-
dims: [1, 80, 620, 1]
74+
dims: [80, -1, 1]
7175
}}
7276
input {{
73-
name: "1"
77+
name: "z"
7478
data_type: {fp_type}
75-
dims: [1, 8, 19840, 1]
79+
dims: [8, -1, 1]
7680
}}
7781
output {{
78-
name: "1991"
82+
name: "audio"
7983
data_type: {fp_type}
80-
dims: [1, 158720]
84+
dims: [-1]
8185
}}
8286
"""
8387

PyTorch/SpeechSynthesis/Tacotron2/inference.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ def parse_args(parser):
5050
help='full path to the input text (phareses separated by new line)')
5151
parser.add_argument('-o', '--output', required=True,
5252
help='output folder to save audio (file per phrase)')
53+
parser.add_argument('--suffix', type=str, default="", help="output filename suffix")
5354
parser.add_argument('--tacotron2', type=str,
5455
help='full path to the Tacotron2 model checkpoint file')
5556
parser.add_argument('--waveglow', type=str,
@@ -242,7 +243,7 @@ def main():
242243
for i, audio in enumerate(audios):
243244
audio = audio[:mel_lengths[i]*args.stft_hop_length]
244245
audio = audio/torch.max(torch.abs(audio))
245-
audio_path = args.output + "audio_"+str(i)+".wav"
246+
audio_path = args.output+"audio_"+str(i)+"_"+args.suffix+".wav"
246247
write(audio_path, args.sampling_rate, audio.cpu().numpy())
247248

248249
DLLogger.flush()

PyTorch/SpeechSynthesis/Tacotron2/notebooks/trtis/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -106,38 +106,38 @@ cd /workspace/onnx-tensorrt/build && cmake .. -DCMAKE_CXX_FLAGS=-isystem\ /usr/l
106106
In order to export the model into the ONNX intermediate representation, type:
107107

108108
```bash
109-
python exports/export_waveglow_onnx.py --waveglow <waveglow_checkpoint> --wn-channels 256 --amp-run
109+
python exports/export_waveglow_onnx.py --waveglow <waveglow_checkpoint> --wn-channels 256 --amp-run --output ./output
110110
```
111111

112112
This will save the model as `waveglow.onnx` (you can change its name with the flag `--output <filename>`).
113113

114114
With the model exported to ONNX, type the following to obtain a TRT engine and save it as `trtis_repo/waveglow/1/model.plan`:
115115

116116
```bash
117-
onnx2trt <exported_waveglow_onnx> -o trtis_repo/waveglow/1/model.plan -b 1 -w 8589934592
117+
python trt/export_onnx2trt.py --waveglow <exported_waveglow_onnx> -o trtis_repo/waveglow/1/ --fp16
118118
```
119119

120120
### Setup the TRTIS server.
121121

122122
Download the TRTIS container by typing:
123123
```bash
124-
docker pull nvcr.io/nvidia/tensorrtserver:19.10-py3
125-
docker tag nvcr.io/nvidia/tensorrtserver:19.10-py3 tensorrtserver:19.10
124+
docker pull nvcr.io/nvidia/tensorrtserver:20.01-py3
125+
docker tag nvcr.io/nvidia/tensorrtserver:20.01-py3 tensorrtserver:20.01
126126
```
127127

128128
### Setup the TRTIS notebook client.
129129

130130
Now go to the root directory of the Tacotron 2 repo, and type:
131131

132132
```bash
133-
docker build -f Dockerfile_trtis_client --network=host -t speech_ai__tts_only:demo .
133+
docker build -f Dockerfile_trtis_client --network=host -t speech_ai_tts_only:demo .
134134
```
135135

136136
### Run the TRTIS server.
137137

138138
To run the server, type in the root directory of the Tacotron 2 repo:
139139
```bash
140-
NV_GPU=1 nvidia-docker run -ti --ipc=host --network=host --rm -p8000:8000 -p8001:8001 -v $PWD/trtis_repo/:/models tensorrtserver:19.10 trtserver --model-store=/models --log-verbose 1
140+
NV_GPU=1 nvidia-docker run -ti --ipc=host --network=host --rm -p8000:8000 -p8001:8001 -v $PWD/trtis_repo/:/models tensorrtserver:20.01 trtserver --model-store=/models --log-verbose 1
141141
```
142142

143143
The flag `NV_GPU` selects the GPU the server is going to see. If we want it to see all the available GPUs, then run the above command without this flag.
@@ -147,7 +147,7 @@ By default, the model repository will be in `trtis_repo/`.
147147

148148
Leave the server running. In another terminal, type:
149149
```bash
150-
docker run -it --rm --network=host --device /dev/snd:/dev/snd --device /dev/usb:/dev/usb speech_ai__tts_only:demo bash ./run_this.sh
150+
docker run -it --rm --network=host --device /dev/snd:/dev/snd --device /dev/usb:/dev/usb speech_ai_tts_only:demo bash ./run_this.sh
151151
```
152152

153153
Open the URL in a browser, open `notebook.ipynb`, click play, and enjoy.

0 commit comments

Comments
 (0)