evolution99
diff --git a/‎Colab_Jasper_TRT_inference_demo.ipynb‎
Lines changed: 4835 additions & 0 deletions b/‎Colab_Jasper_TRT_inference_demo.ipynb‎
Lines changed: 4835 additions & 0 deletions
diff --git a/‎FasterTransformer/README.md‎
Lines changed: 49 additions & 12 deletions b/‎FasterTransformer/README.md‎
Lines changed: 49 additions & 12 deletions
diff --git a/‎FasterTransformer/v2.1/.gitignore‎
Lines changed: 5 additions & 0 deletions b/‎FasterTransformer/v2.1/.gitignore‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎FasterTransformer/v2.1/.gitmodules‎
Lines changed: 7 additions & 0 deletions b/‎FasterTransformer/v2.1/.gitmodules‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎FasterTransformer/v2.1/CMakeLists.txt‎
Lines changed: 184 additions & 0 deletions b/‎FasterTransformer/v2.1/CMakeLists.txt‎
Lines changed: 184 additions & 0 deletions
@@ -3,14 +3,16 @@
 This repository provides a script and recipe to run the highly optimized transformer for inference, and it is tested and maintained by NVIDIA.
 
 ## Table Of Contents
-- [Models overview](#model-overview)
-    * [FasterTransformer V1](#model-architecture)
-    * [FasterTransformer V2](#default-configuration)
-    * [Architecture matrix](#feature-support-matrix)
-- [Release notes](#release-notes)
-    * [Changelog](#changelog)
-    * [Known issues](#known-issues)
-
+- [FasterTransformer](#fastertransformer)
+  - [Table Of Contents](#table-of-contents)
+  - [Model overview](#model-overview)
+    - [FasterTransformer V1](#fastertransformer-v1)
+    - [FasterTransformer V2](#fastertransformer-v2)
+    - [FasterTransformer V2.1](#fastertransformer-v21)
+    - [Architecture matrix](#architecture-matrix)
+  - [Release notes](#release-notes)
+    - [Changelog](#changelog)
+  - [Known issues](#known-issues)
 
 ## Model overview
 
@@ -22,21 +24,56 @@ FasterTransformer V1 provides a highly optimized BERT equivalent Transformer lay
 
 FastTransformer V2 adds a highly optimized OpenNMT-tf based decoder and decoding for inference in FasterTransformer V1, including C++ API and TensorFlow op. The experiments show that FasterTransformer V2 can provide 1.5 ~ 11 times speedup on NVIDIA Telsa T4 and NVIDIA Tesla V 100 for inference.
 
+### FasterTransformer V2.1
+
+FasterTransformer V2.1 optimizes some kernels of encoder and decoder, adding the support of PyTorch, the support of remove the padding of encoder and the support of sampling algorithm in decoding. 
+
 ### Architecture matrix
 
 The following matrix shows the Architecture Differences between the model.
 
-| Architecure               | Encoder             |Decoder             |
-|-----------------------|--------------------------|---------------|
-|FasterTransformer V1  |  Yes |No |
-|FasterTransformer V2  |  Yes |Yes |
+| Architecure               | Encoder             |Decoder             | Decoding with beam search | Decoding with sampling |
+|---------------------------|---------------------|--------------------|---------------------------|------------------------|
+|FasterTransformer V1    |  Yes | No  | No  | No  |
+|FasterTransformer V2    |  Yes | Yes | Yes | No  |
+|FasterTransformer V2.1  |  Yes | Yes | Yes | Yes |
 
 
 ## Release notes
+
 FasterTransformer V1 will be deprecated on July 2020. 
 
+FasterTransformer V2 will be deprecated on Dec 2020. 
+
 ### Changelog
 
+June 2020
+- **Release the FasterTransformer 2.1**
+- Add [effective transformer](https://github.com/bytedance/effective_transformer) supporting into encoder.
+- Optimize the beam search kernels.
+- Add PyTorch op supporting
+
+May 2020
+- Fix the bug that seq_len of encoder must be larger than 3.
+- Add the position_encoding of decoding as the input of FasterTransformer decoding. This is convenient to use different types of position encoding. FasterTransformer does not compute the position encoding value, but only lookup the table. 
+- Modifying the method of loading model in `translate_sample.py`.
+
+April 2020
+- Rename `decoding_opennmt.h` to `decoding_beamsearch.h`
+- Add DiverseSiblingsSearch for decoding.
+- Add sampling into Decoding
+  - The implementation is in the `decoding_sampling.h`
+  - Add top_k sampling, top_p sampling for decoding.
+- Refactor the tensorflow custom op codes.
+  - Merge `bert_transformer_op.h`, `bert_transformer_op.cu.cc` into `bert_transformer_op.cc`
+  - Merge `decoder.h`, `decoder.cu.cc` into `decoder.cc`
+  - Merge `decoding_beamsearch.h`, `decoding_beamsearch.cu.cc` into `decoding_beamsearch.cc`
+- Fix the bugs of finalize function decoding.py. 
+- Fix the bug of tf DiverseSiblingSearch.
+- Add BLEU scorer `bleu_score.py` into `utils`. Note that the BLEU score requires python3. 
+- Fuse QKV Gemm of encoder and masked_multi_head_attention of decoder.
+- Add dynamic batch size and dynamic sequence length features into all ops.
+
 March 2020
 - Add feature in FasterTransformer 2.0
   - Fix the bug of maximum sequence length of decoder cannot be larger than 128.
 
@@ -0,0 +1,5 @@
+*~
+*.o
+build*/
+*.pyc
+.vscode/
@@ -0,0 +1,7 @@
+[submodule "sample/fastertransformer_bert/bert"]
+	path = sample/tensorflow_bert/bert
+	url = https://github.com/google-research/bert.git
+
+[submodule "OpenNMT-tf"]
+	path = OpenNMT-tf
+	url = https://github.com/OpenNMT/OpenNMT-tf
@@ -0,0 +1,184 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+cmake_minimum_required(VERSION 3.8 FATAL_ERROR) # for PyTorch extensions, version should be greater than 3.13
+project(FasterTransformer LANGUAGES CXX CUDA)
+
+find_package(CUDA 10.0 REQUIRED)
+
+option(BUILD_TRT "Build in TensorRT mode" OFF)
+option(BUILD_TF "Build in TensorFlow mode" OFF)
+option(BUILD_THE "Build in PyTorch eager mode" OFF)
+option(BUILD_THS "Build in TorchScript class mode" OFF)
+option(BUILD_THSOP "Build in TorchScript OP mode" OFF)
+
+set(CXX_STD "11" CACHE STRING "C++ standard")
+
+set(CUDA_PATH ${CUDA_TOOLKIT_ROOT_DIR})
+
+set(TF_PATH "" CACHE STRING "TensorFlow path")
+#set(TF_PATH "/usr/local/lib/python3.5/dist-packages/tensorflow")
+
+if(BUILD_TF AND NOT TF_PATH)
+  message(FATAL_ERROR "TF_PATH must be set if BUILD_TF(=TensorFlow mode) is on.")
+endif()
+
+set(TRT_PATH "" CACHE STRING "TensorRT path")
+#set(TRT_PATH "/myspace/TensorRT-5.1.5.0")
+
+if(BUILD_TRT AND NOT TRT_PATH)
+  message(FATAL_ERROR "TRT_PATH must be set if BUILD_TRT(=TensorRT mode) is on.")
+endif()
+
+list(APPEND CMAKE_MODULE_PATH ${CUDA_PATH}/lib64)
+find_package(CUDA REQUIRED)
+
+# setting compiler flags
+set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}")	
+set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}  -Xcompiler -Wall")
+
+if (SM STREQUAL 70 OR
+    SM STREQUAL 75 OR
+    SM STREQUAL 61 OR
+    SM STREQUAL 60)
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_${SM},code=\\\"sm_${SM},compute_${SM}\\\" -rdc=true")
+  if (SM STREQUAL 70 OR SM STREQUAL 75)
+    set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+    set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+  endif()
+if(BUILD_THE OR BUILD_THS OR BUILD_THSOP)
+  string(SUBSTRING ${SM} 0 1 SM_MAJOR)
+  string(SUBSTRING ${SM} 1 1 SM_MINOR)
+  set(ENV{TORCH_CUDA_ARCH_LIST} "${SM_MAJOR}.${SM_MINOR}")
+endif()
+message("-- Assign GPU architecture (sm=${SM})")
+
+else()
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}  \
+                      -gencode=arch=compute_60,code=\\\"sm_60,compute_60\\\" \
+                      -gencode=arch=compute_61,code=\\\"sm_61,compute_61\\\" \
+                      -gencode=arch=compute_70,code=\\\"sm_70,compute_70\\\" \
+                      -gencode=arch=compute_75,code=\\\"sm_75,compute_75\\\" \
+                      -rdc=true")
+set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+if(BUILD_THE OR BUILD_THS OR BUILD_THSOP)
+  set(ENV{TORCH_CUDA_ARCH_LIST} "6.0;6.1;7.0;7.5")
+endif()
+message("-- Assign GPU architecture (sm=60,61,70,75)")
+endif()
+
+set(CMAKE_C_FLAGS_DEBUG    "${CMAKE_C_FLAGS_DEBUG}    -Wall -O0")
+set(CMAKE_CXX_FLAGS_DEBUG  "${CMAKE_CXX_FLAGS_DEBUG}  -Wall -O0")
+set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -O0 -G -Xcompiler -Wall  --ptxas-options=-v --resource-usage")
+
+set(CMAKE_CXX_STANDARD "${CXX_STD}")
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --std=c++${CXX_STD}")
+
+set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3")
+set(CMAKE_CUDA_FLAGS_RELEASE "${CMAKE_CUDA_FLAGS_RELEASE} -Xcompiler -O3")
+
+set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
+set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
+
+set(COMMON_HEADER_DIRS
+  ${PROJECT_SOURCE_DIR}
+  ${CUDA_PATH}/include
+)
+
+set(COMMON_LIB_DIRS
+  ${CUDA_PATH}/lib64
+)
+
+if(BUILD_TF)
+  list(APPEND COMMON_HEADER_DIRS ${TF_PATH}/include)
+  list(APPEND COMMON_LIB_DIRS ${TF_PATH})
+endif()
+
+if(BUILD_TRT)
+  list(APPEND COMMON_HEADER_DIRS ${TRT_PATH}/include)
+  list(APPEND COMMON_LIB_DIRS ${TRT_PATH}/lib)
+endif()
+
+set(PYTHON_PATH "python" CACHE STRING "Python path")
+if(BUILD_THE OR BUILD_THS OR BUILD_THSOP)
+  execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; import os; import torch;
+print(os.path.dirname(torch.__file__),end='');"
+                  RESULT_VARIABLE _PYTHON_SUCCESS
+                  OUTPUT_VARIABLE TORCH_DIR)
+  if (NOT _PYTHON_SUCCESS MATCHES 0)
+      message(FATAL_ERROR "Torch config Error.")
+  endif()
+  list(APPEND CMAKE_PREFIX_PATH ${TORCH_DIR})
+  find_package(Torch REQUIRED)
+
+  execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; from distutils import sysconfig;
+print(sysconfig.get_python_inc());
+print(sysconfig.get_config_var('SO'));"
+                  RESULT_VARIABLE _PYTHON_SUCCESS
+                  OUTPUT_VARIABLE _PYTHON_VALUES)
+  if (NOT _PYTHON_SUCCESS MATCHES 0)
+      message(FATAL_ERROR "Python config Error.")
+  endif()
+  string(REGEX REPLACE ";" "\\\\;" _PYTHON_VALUES ${_PYTHON_VALUES})
+  string(REGEX REPLACE "\n" ";" _PYTHON_VALUES ${_PYTHON_VALUES})
+  list(GET _PYTHON_VALUES 0 PY_INCLUDE_DIR)
+  list(GET _PYTHON_VALUES 1 PY_SUFFIX)
+  list(APPEND COMMON_HEADER_DIRS ${PY_INCLUDE_DIR})
+
+  execute_process(COMMAND ${PYTHON_PATH} "-c" "from torch.utils import cpp_extension; print(' '.join(cpp_extension._prepare_ldflags([],True,False)),end='');"
+                  RESULT_VARIABLE _PYTHON_SUCCESS
+                  OUTPUT_VARIABLE TORCH_LINK)
+  if (NOT _PYTHON_SUCCESS MATCHES 0)
+      message(FATAL_ERROR "PyTorch link config Error.")
+  endif()
+endif()
+
+
+include_directories(
+  ${COMMON_HEADER_DIRS}
+)
+
+link_directories(
+  ${COMMON_LIB_DIRS}
+)
+
+add_subdirectory(tools)
+add_subdirectory(fastertransformer)
+add_subdirectory(sample)
+
+if(BUILD_TF)
+  add_custom_target(copy ALL COMMENT "Copying tensorflow test scripts")
+  add_custom_command(TARGET copy
+      POST_BUILD
+      COMMAND cp ${PROJECT_SOURCE_DIR}/sample/tensorflow/ ${PROJECT_BINARY_DIR} -r
+      COMMAND cp ${PROJECT_SOURCE_DIR}/sample/tensorflow_bert ${PROJECT_BINARY_DIR}/tensorflow -r
+ )
+endif()
+
+if(BUILD_THE OR BUILD_THS OR BUILD_THSOP)
+  add_custom_target(copy ALL COMMENT "Copying pytorch test scripts")
+  add_custom_command(TARGET copy
+      POST_BUILD
+      COMMAND cp ${PROJECT_SOURCE_DIR}/sample/pytorch/ ${PROJECT_BINARY_DIR} -r
+      COMMAND mkdir -p ${PROJECT_BINARY_DIR}/pytorch/translation/data/
+      COMMAND cp ${PROJECT_SOURCE_DIR}/sample/tensorflow/utils/translation/test.* ${PROJECT_BINARY_DIR}/pytorch/translation/data/
+ )
+endif()
-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +*~
 +*.o
 +build*/
 +*.pyc
 +.vscode/