feifeibear
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎PyTorch/SpeechSynthesis/Tacotron2/.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/Dockerfile‎
Lines changed: 3 additions & 3 deletions b/‎PyTorch/SpeechSynthesis/Tacotron2/Dockerfile‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/Dockerfile_trtis_client‎
Lines changed: 1 addition & 1 deletion b/‎PyTorch/SpeechSynthesis/Tacotron2/Dockerfile_trtis_client‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/README.md‎
Lines changed: 20 additions & 17 deletions b/‎PyTorch/SpeechSynthesis/Tacotron2/README.md‎
Lines changed: 20 additions & 17 deletions
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/common/stft.py‎
Lines changed: 1 addition & 1 deletion b/‎PyTorch/SpeechSynthesis/Tacotron2/common/stft.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/exports/export_tacotron2_ts.py‎
Lines changed: 2 additions & 0 deletions b/‎PyTorch/SpeechSynthesis/Tacotron2/exports/export_tacotron2_ts.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/exports/export_waveglow_onnx.py‎
Lines changed: 47 additions & 24 deletions b/‎PyTorch/SpeechSynthesis/Tacotron2/exports/export_waveglow_onnx.py‎
Lines changed: 47 additions & 24 deletions
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/exports/export_waveglow_trt_config.py‎
Lines changed: 10 additions & 6 deletions b/‎PyTorch/SpeechSynthesis/Tacotron2/exports/export_waveglow_trt_config.py‎
Lines changed: 10 additions & 6 deletions
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/inference.py‎
Lines changed: 2 additions & 1 deletion b/‎PyTorch/SpeechSynthesis/Tacotron2/inference.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎PyTorch/SpeechSynthesis/Tacotron2/notebooks/trtis/README.md‎
Lines changed: 7 additions & 7 deletions b/‎PyTorch/SpeechSynthesis/Tacotron2/notebooks/trtis/README.md‎
Lines changed: 7 additions & 7 deletions
@@ -0,0 +1,4 @@
+__pycache__/
+/checkpoints/
+/output/
+nvlog.json
@@ -1,6 +1,6 @@
-FROM nvcr.io/nvidia/pytorch:19.11-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.01-py3
+FROM ${FROM_IMAGE_NAME}
 
 ADD . /workspace/tacotron2
 WORKDIR /workspace/tacotron2
-RUN pip install -r requirements.txt
-RUN pip --no-cache-dir --no-cache install  'git+https://github.com/NVIDIA/dllogger'
+RUN pip install --no-cache-dir -r requirements.txt
@@ -11,7 +11,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-FROM nvcr.io/nvidia/tensorrtserver:19.10-py3-clientsdk AS trt
+FROM nvcr.io/nvidia/tensorrtserver:20.01-py3-clientsdk AS trt
 FROM continuumio/miniconda3
 RUN apt-get update && apt-get install -y pbzip2 pv bzip2 cabextract mc iputils-ping wget
 
 
@@ -231,7 +231,7 @@ and encapsulates some dependencies. Aside from these dependencies, ensure you
 have the following components:
 
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.06-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+* [PyTorch 20.01-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
 or newer
 * [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
 
@@ -370,7 +370,7 @@ WaveGlow models.
 
 * `--epochs` - number of epochs (Tacotron 2: 1501, WaveGlow: 1001)
 * `--learning-rate` - learning rate (Tacotron 2: 1e-3, WaveGlow: 1e-4)
-* `--batch-size` - batch size (Tacotron 2 FP16/FP32: 128/64, WaveGlow FP16/FP32: 10/4)
+* `--batch-size` - batch size (Tacotron 2 FP16/FP32: 104/48, WaveGlow FP16/FP32: 10/4)
 * `--amp-run` - use mixed precision training
 
 #### Shared audio/STFT parameters
@@ -496,21 +496,21 @@ To benchmark the training performance on a specific batch size, run:
 * For 1 GPU
 	* FP16
         ```bash
-        python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_2500_filelist.txt --dataset-path <dataset-path> --amp-run
+        python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path> --amp-run
         ```
 	* FP32
         ```bash
-        python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_2500_filelist.txt --dataset-path <dataset-path>
+        python train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path>
         ```
 
 * For multiple GPUs
 	* FP16
         ```bash
-        python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_2500_filelist.txt --dataset-path <dataset-path> --amp-run
+        python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path> --amp-run
         ```
 	* FP32
         ```bash
-        python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --training-files filelists/ljs_audio_text_train_subset_2500_filelist.txt --dataset-path <dataset-path>
+        python -m multiproc train.py -m Tacotron2 -o <output_dir> -lr 1e-3 --epochs 10 -bs <batch_size> --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --log-file nvlog.json --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_subset_2500_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --dataset-path <dataset-path>
         ```
 
 **WaveGlow**
@@ -579,10 +579,10 @@ All of the results were produced using the `train.py` script as described in the
 | WaveGlow FP16  | -2.2054 | -5.7602 |  -5.901 | -5.9706 | -6.0258 |
 | WaveGlow FP32  | -3.0327 |  -5.858 | -6.0056 | -6.0613 | -6.1087 |
 
-Tacotron 2 FP16 loss - batch size 128 (mean and std over 16 runs)
+Tacotron 2 FP16 loss - batch size 104 (mean and std over 16 runs)
 ![](./img/tacotron2_amp_loss.png "Tacotron 2 FP16 loss")
 
-Tacotron 2 FP32 loss - batch size 64 (mean and std over 16 runs)
+Tacotron 2 FP32 loss - batch size 48 (mean and std over 16 runs)
 ![](./img/tacotron2_fp32_loss.png "Tacotron 2 FP16 loss")
 
 WaveGlow FP16 loss - batch size 10 (mean and std over 16 runs)
@@ -597,7 +597,7 @@ WaveGlow FP32 loss - batch size 4 (mean and std over 16 runs)
 ##### Training performance: NVIDIA DGX-1 (8x V100 16G)
 
 Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{AMP,FP32}_DGX1_16GB_8GPU.sh`
-training script in the PyTorch-19.06-py3 NGC container on NVIDIA DGX-1 with
+training script in the PyTorch-19.12-py3 NGC container on NVIDIA DGX-1 with
 8x V100 16G GPUs. Performance numbers (in output mel-spectrograms per second for
 Tacotron 2 and output samples per second for WaveGlow) were averaged over
 an entire training epoch.
@@ -606,9 +606,9 @@ This table shows the results for Tacotron 2:
 
 |Number of GPUs|Batch size per GPU|Number of mels used with mixed precision|Number of mels used with FP32|Speed-up with mixed precision|Multi-GPU weak scaling with mixed precision|Multi-GPU weak scaling with FP32|
 |---:|---:|---:|---:|---:|---:|---:|
-|1|128@FP16, 64@FP32 | 20,992  | 12,933 | 1.62 | 1.00 | 1.00 |
-|4|128@FP16, 64@FP32 | 74,989  | 46,115 | 1.63 | 3.57 | 3.57 |
-|8|128@FP16, 64@FP32 | 140,060 | 88,719 | 1.58 | 6.67 | 6.86 |
+|1|104@FP16, 48@FP32 | 15,313 | 9,674 | 1.58 | 1.00 | 1.00 |
+|4|104@FP16, 48@FP32 | 53,661 | 32,778 | 1.64 | 3.50 | 3.39 |
+|8|104@FP16, 48@FP32 | 100,422 | 59,549 | 1.69 | 6.56 | 6.16 |
 
 The following table shows the results for WaveGlow:
 
@@ -626,9 +626,9 @@ The following table shows the expected training time for convergence for Tacotro
 
 |Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
 |---:|---:|---:|---:|---:|
-|1| 128@FP16, 64@FP32 | 153 | 234 | 1.53 |
-|4| 128@FP16, 64@FP32 | 42 | 64 | 1.54 |
-|8| 128@FP16, 64@FP32 | 22 | 33 | 1.52 |
+|1| 104@FP16, 48@FP32 | 193 | 312 | 1.62 |
+|4| 104@FP16, 48@FP32 | 53 | 85 | 1.58 |
+|8| 104@FP16, 48@FP32 | 31 | 45 | 1.47 |
 
 The following table shows the expected training time for convergence for WaveGlow (1001 epochs):
 
@@ -704,8 +704,11 @@ November 2019
 * Implemented training resume from checkpoint
 * Added notebook for running Tacotron 2 and WaveGlow in TRTIS.
 
-December  2019
-* Added `trt` subfolder for running Tacotron 2 and WaveGlow in TensorRT.
+December 2019
+* Added export and inference scripts for TensorRT. See [Tacotron2 TensorRT README](trt/README.md).
+
+January 2020
+* Updated batch sizes and performance results for Tacotron 2.
 
 ### Known issues
 
 
@@ -58,7 +58,7 @@ def __init__(self, filter_length=800, hop_length=200, win_length=800,
 
         forward_basis = torch.FloatTensor(fourier_basis[:, None, :])
         inverse_basis = torch.FloatTensor(
-            np.linalg.pinv(scale * fourier_basis).T[:, None, :])
+            np.linalg.pinv(scale * fourier_basis).T[:, None, :].astype(np.float32))
 
         if window is not None:
             assert(filter_length >= win_length)
 
@@ -27,6 +27,8 @@
 
 import torch
 import argparse
+import sys
+sys.path.append('./')
 from inference import checkpoint_from_distributed, unwrap_distributed, load_and_setup_model
 
 def parse_args(parser):
 
@@ -25,6 +25,7 @@
 #
 # *****************************************************************************
 
+import types
 import torch
 import argparse
 
@@ -113,32 +114,52 @@ def convert_1d_to_2d_(glow):
 
     glow.cuda()
 
-def test_inference(waveglow):
 
+def infer_onnx(self, spect, z, sigma=0.9):
 
-    from scipy.io.wavfile import write
+    spect = self.upsample(spect)
+    # trim conv artifacts. maybe pad spec to kernel multiple
+    time_cutoff = self.upsample.kernel_size[0] - self.upsample.stride[0]
+    spect = spect[:, :, :-time_cutoff]
 
-    mel = torch.load("mel.pt").cuda()
-    # mel = torch.load("mel_spectrograms/LJ001-0015.wav.pt").cuda()
-    # mel = mel.unsqueeze(0)
-    mel_lengths = [mel.size(2)]
-    stride = 256
-    kernel_size = 1024
-    n_group = 8
-    z_size2 = (mel.size(2)-1)*stride+(kernel_size-1)+1
-    # corresponds to cutoff in infer_onnx
-    z_size2 = z_size2 - (kernel_size-stride)
-    z_size2 = z_size2//n_group
-    z = torch.randn(1, n_group, z_size2, 1).cuda()
-    mel = mel.unsqueeze(3)
+    length_spect_group = spect.size(2)//8
+    mel_dim = 80
+    batch_size = spect.size(0)
 
-    with torch.no_grad():
-        audios = waveglow(mel, z)
+    spect = torch.squeeze(spect, 3)
+    spect = spect.view((batch_size, mel_dim, length_spect_group, self.n_group))
+    spect = spect.permute(0, 2, 1, 3)
+    spect = spect.contiguous()
+    spect = spect.view((batch_size, length_spect_group, self.n_group*mel_dim))
+    spect = spect.permute(0, 2, 1)
+    spect = torch.unsqueeze(spect, 3)
+    spect = spect.contiguous()
+
+    audio = z[:, :self.n_remaining_channels, :, :]
+    z = z[:, self.n_remaining_channels:self.n_group, :, :]
+    audio = sigma*audio
+
+    for k in reversed(range(self.n_flows)):
+        n_half = int(audio.size(1) / 2)
+        audio_0 = audio[:, :n_half, :, :]
+        audio_1 = audio[:, n_half:(n_half+n_half), :, :]
 
-    for i, audio in enumerate(audios):
-        audio = audio[:mel_lengths[i]*256]
-        audio = audio/torch.max(torch.abs(audio))
-        write("audio_pyt.wav", 22050, audio.cpu().numpy())
+        output = self.WN[k]((audio_0, spect))
+        s = output[:, n_half:(n_half+n_half), :, :]
+        b = output[:, :n_half, :, :]
+        audio_1 = (audio_1 - b) / torch.exp(s)
+        audio = torch.cat([audio_0, audio_1], 1)
+
+        audio = self.convinv[k](audio)
+
+        if k % self.n_early_every == 0 and k > 0:
+            audio = torch.cat((z[:, :self.n_early_size, :, :], audio), 1)
+            z = z[:, self.n_early_size:self.n_group, :, :]
+
+    audio = torch.squeeze(audio, 3)
+    audio = audio.permute(0,2,1).contiguous().view(batch_size, (length_spect_group * self.n_group))
+
+    return audio
 
 
 def export_onnx(parser, args):
@@ -166,12 +187,16 @@ def export_onnx(parser, args):
 
         # export to ONNX
         convert_1d_to_2d_(waveglow)
-        waveglow.forward = waveglow.infer_onnx
+
+        fType = types.MethodType
+        waveglow.forward = fType(infer_onnx, waveglow)
+
         if args.amp_run:
             waveglow.half()
         mel = mel.unsqueeze(3)
 
         opset_version = 10
+
         torch.onnx.export(waveglow, (mel, z), args.output+"/"+"waveglow.onnx",
                           opset_version=opset_version,
                           do_constant_folding=True,
@@ -181,8 +206,6 @@ def export_onnx(parser, args):
                                         "z":     {0: "batch_size", 2: "z_seq"},
                                         "audio": {0: "batch_size", 1: "audio_seq"}})
 
-    test_inference(waveglow)
-
 
 def main():
 
 
@@ -64,20 +64,24 @@ def main():
     config_template = r"""
 name: "{model_name}"
 platform: "tensorrt_plan"
+default_model_filename: "waveglow_fp16.engine"
+
+max_batch_size: 1
+
 input {{
-  name: "0"
+  name: "mel"
   data_type: {fp_type}
-  dims: [1, 80, 620, 1]
+  dims: [80, -1, 1]
 }}
 input {{
-  name: "1"
+  name: "z"
   data_type: {fp_type}
-  dims: [1, 8, 19840, 1]
+  dims: [8, -1, 1]
 }}
 output {{
-  name: "1991"
+  name: "audio"
   data_type: {fp_type}
-  dims: [1, 158720]
+  dims: [-1]
 }}
 """
 
 
@@ -50,6 +50,7 @@ def parse_args(parser):
                         help='full path to the input text (phareses separated by new line)')
     parser.add_argument('-o', '--output', required=True,
                         help='output folder to save audio (file per phrase)')
+    parser.add_argument('--suffix', type=str, default="", help="output filename suffix")
     parser.add_argument('--tacotron2', type=str,
                         help='full path to the Tacotron2 model checkpoint file')
     parser.add_argument('--waveglow', type=str,
@@ -242,7 +243,7 @@ def main():
     for i, audio in enumerate(audios):
         audio = audio[:mel_lengths[i]*args.stft_hop_length]
         audio = audio/torch.max(torch.abs(audio))
-        audio_path = args.output + "audio_"+str(i)+".wav"
+        audio_path = args.output+"audio_"+str(i)+"_"+args.suffix+".wav"
         write(audio_path, args.sampling_rate, audio.cpu().numpy())
 
     DLLogger.flush()
 
@@ -106,38 +106,38 @@ cd /workspace/onnx-tensorrt/build && cmake .. -DCMAKE_CXX_FLAGS=-isystem\ /usr/l
 In order to export the model into the ONNX intermediate representation, type:
 
 ```bash
-python exports/export_waveglow_onnx.py --waveglow <waveglow_checkpoint> --wn-channels 256 --amp-run
+python exports/export_waveglow_onnx.py --waveglow <waveglow_checkpoint> --wn-channels 256 --amp-run --output ./output
 ```
 
 This will save the model as `waveglow.onnx` (you can change its name with the flag `--output <filename>`).
 
 With the model exported to ONNX, type the following to obtain a TRT engine and save it as `trtis_repo/waveglow/1/model.plan`:
 
 ```bash
-onnx2trt <exported_waveglow_onnx> -o trtis_repo/waveglow/1/model.plan -b 1 -w 8589934592
+python trt/export_onnx2trt.py --waveglow  <exported_waveglow_onnx> -o trtis_repo/waveglow/1/ --fp16
 ```
 
 ### Setup the TRTIS server.
 
 Download the TRTIS container by typing:
 ```bash
-docker pull nvcr.io/nvidia/tensorrtserver:19.10-py3
-docker tag nvcr.io/nvidia/tensorrtserver:19.10-py3 tensorrtserver:19.10
+docker pull nvcr.io/nvidia/tensorrtserver:20.01-py3
+docker tag nvcr.io/nvidia/tensorrtserver:20.01-py3 tensorrtserver:20.01
 ```
 
 ### Setup the TRTIS notebook client.
 
 Now go to the root directory of the Tacotron 2 repo, and type: 
 
 ```bash
-docker build -f Dockerfile_trtis_client --network=host -t speech_ai__tts_only:demo .
+docker build -f Dockerfile_trtis_client --network=host -t speech_ai_tts_only:demo .
 ```
 
 ### Run the TRTIS server.
 
 To run the server, type in the root directory of the Tacotron 2 repo:
 ```bash
-NV_GPU=1 nvidia-docker run -ti --ipc=host --network=host --rm -p8000:8000 -p8001:8001 -v $PWD/trtis_repo/:/models tensorrtserver:19.10 trtserver --model-store=/models --log-verbose 1
+NV_GPU=1 nvidia-docker run -ti --ipc=host --network=host --rm -p8000:8000 -p8001:8001 -v $PWD/trtis_repo/:/models tensorrtserver:20.01 trtserver --model-store=/models --log-verbose 1
 ```
 
 The flag `NV_GPU` selects the GPU the server is going to see. If we want it to see all the available GPUs, then run the above command without this flag.
@@ -147,7 +147,7 @@ By default, the model repository will be in `trtis_repo/`.
 
 Leave the server running. In another terminal, type:
 ```bash
-docker run -it --rm --network=host --device /dev/snd:/dev/snd --device /dev/usb:/dev/usb speech_ai__tts_only:demo bash ./run_this.sh
+docker run -it --rm --network=host --device /dev/snd:/dev/snd --device /dev/usb:/dev/usb speech_ai_tts_only:demo bash ./run_this.sh
 ```
 
 Open the URL in a browser, open `notebook.ipynb`, click play, and enjoy.