Skip to content

Commit da2e33a

Browse files
committed
Merge branch 'master' into rn50_qat_v2
2 parents 6d2357a + 4d80805 commit da2e33a

488 files changed

Lines changed: 121909 additions & 4019 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Colab_Jasper_TRT_inference_demo.ipynb

Lines changed: 4835 additions & 0 deletions
Large diffs are not rendered by default.

FasterTransformer/README.md

Lines changed: 49 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,16 @@
33
This repository provides a script and recipe to run the highly optimized transformer for inference, and it is tested and maintained by NVIDIA.
44

55
## Table Of Contents
6-
- [Models overview](#model-overview)
7-
* [FasterTransformer V1](#model-architecture)
8-
* [FasterTransformer V2](#default-configuration)
9-
* [Architecture matrix](#feature-support-matrix)
10-
- [Release notes](#release-notes)
11-
* [Changelog](#changelog)
12-
* [Known issues](#known-issues)
13-
6+
- [FasterTransformer](#fastertransformer)
7+
- [Table Of Contents](#table-of-contents)
8+
- [Model overview](#model-overview)
9+
- [FasterTransformer V1](#fastertransformer-v1)
10+
- [FasterTransformer V2](#fastertransformer-v2)
11+
- [FasterTransformer V2.1](#fastertransformer-v21)
12+
- [Architecture matrix](#architecture-matrix)
13+
- [Release notes](#release-notes)
14+
- [Changelog](#changelog)
15+
- [Known issues](#known-issues)
1416

1517
## Model overview
1618

@@ -22,21 +24,56 @@ FasterTransformer V1 provides a highly optimized BERT equivalent Transformer lay
2224

2325
FastTransformer V2 adds a highly optimized OpenNMT-tf based decoder and decoding for inference in FasterTransformer V1, including C++ API and TensorFlow op. The experiments show that FasterTransformer V2 can provide 1.5 ~ 11 times speedup on NVIDIA Telsa T4 and NVIDIA Tesla V 100 for inference.
2426

27+
### FasterTransformer V2.1
28+
29+
FasterTransformer V2.1 optimizes some kernels of encoder and decoder, adding the support of PyTorch, the support of remove the padding of encoder and the support of sampling algorithm in decoding.
30+
2531
### Architecture matrix
2632

2733
The following matrix shows the Architecture Differences between the model.
2834

29-
| Architecure | Encoder |Decoder |
30-
|-----------------------|--------------------------|---------------|
31-
|FasterTransformer V1 | Yes |No |
32-
|FasterTransformer V2 | Yes |Yes |
35+
| Architecure | Encoder |Decoder | Decoding with beam search | Decoding with sampling |
36+
|---------------------------|---------------------|--------------------|---------------------------|------------------------|
37+
|FasterTransformer V1 | Yes | No | No | No |
38+
|FasterTransformer V2 | Yes | Yes | Yes | No |
39+
|FasterTransformer V2.1 | Yes | Yes | Yes | Yes |
3340

3441

3542
## Release notes
43+
3644
FasterTransformer V1 will be deprecated on July 2020.
3745

46+
FasterTransformer V2 will be deprecated on Dec 2020.
47+
3848
### Changelog
3949

50+
June 2020
51+
- **Release the FasterTransformer 2.1**
52+
- Add [effective transformer](https://github.com/bytedance/effective_transformer) supporting into encoder.
53+
- Optimize the beam search kernels.
54+
- Add PyTorch op supporting
55+
56+
May 2020
57+
- Fix the bug that seq_len of encoder must be larger than 3.
58+
- Add the position_encoding of decoding as the input of FasterTransformer decoding. This is convenient to use different types of position encoding. FasterTransformer does not compute the position encoding value, but only lookup the table.
59+
- Modifying the method of loading model in `translate_sample.py`.
60+
61+
April 2020
62+
- Rename `decoding_opennmt.h` to `decoding_beamsearch.h`
63+
- Add DiverseSiblingsSearch for decoding.
64+
- Add sampling into Decoding
65+
- The implementation is in the `decoding_sampling.h`
66+
- Add top_k sampling, top_p sampling for decoding.
67+
- Refactor the tensorflow custom op codes.
68+
- Merge `bert_transformer_op.h`, `bert_transformer_op.cu.cc` into `bert_transformer_op.cc`
69+
- Merge `decoder.h`, `decoder.cu.cc` into `decoder.cc`
70+
- Merge `decoding_beamsearch.h`, `decoding_beamsearch.cu.cc` into `decoding_beamsearch.cc`
71+
- Fix the bugs of finalize function decoding.py.
72+
- Fix the bug of tf DiverseSiblingSearch.
73+
- Add BLEU scorer `bleu_score.py` into `utils`. Note that the BLEU score requires python3.
74+
- Fuse QKV Gemm of encoder and masked_multi_head_attention of decoder.
75+
- Add dynamic batch size and dynamic sequence length features into all ops.
76+
4077
March 2020
4178
- Add feature in FasterTransformer 2.0
4279
- Fix the bug of maximum sequence length of decoder cannot be larger than 128.

FasterTransformer/v2.1/.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
*~
2+
*.o
3+
build*/
4+
*.pyc
5+
.vscode/

FasterTransformer/v2.1/.gitmodules

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[submodule "sample/fastertransformer_bert/bert"]
2+
path = sample/tensorflow_bert/bert
3+
url = https://github.com/google-research/bert.git
4+
5+
[submodule "OpenNMT-tf"]
6+
path = OpenNMT-tf
7+
url = https://github.com/OpenNMT/OpenNMT-tf
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
cmake_minimum_required(VERSION 3.8 FATAL_ERROR) # for PyTorch extensions, version should be greater than 3.13
15+
project(FasterTransformer LANGUAGES CXX CUDA)
16+
17+
find_package(CUDA 10.0 REQUIRED)
18+
19+
option(BUILD_TRT "Build in TensorRT mode" OFF)
20+
option(BUILD_TF "Build in TensorFlow mode" OFF)
21+
option(BUILD_THE "Build in PyTorch eager mode" OFF)
22+
option(BUILD_THS "Build in TorchScript class mode" OFF)
23+
option(BUILD_THSOP "Build in TorchScript OP mode" OFF)
24+
25+
set(CXX_STD "11" CACHE STRING "C++ standard")
26+
27+
set(CUDA_PATH ${CUDA_TOOLKIT_ROOT_DIR})
28+
29+
set(TF_PATH "" CACHE STRING "TensorFlow path")
30+
#set(TF_PATH "/usr/local/lib/python3.5/dist-packages/tensorflow")
31+
32+
if(BUILD_TF AND NOT TF_PATH)
33+
message(FATAL_ERROR "TF_PATH must be set if BUILD_TF(=TensorFlow mode) is on.")
34+
endif()
35+
36+
set(TRT_PATH "" CACHE STRING "TensorRT path")
37+
#set(TRT_PATH "/myspace/TensorRT-5.1.5.0")
38+
39+
if(BUILD_TRT AND NOT TRT_PATH)
40+
message(FATAL_ERROR "TRT_PATH must be set if BUILD_TRT(=TensorRT mode) is on.")
41+
endif()
42+
43+
list(APPEND CMAKE_MODULE_PATH ${CUDA_PATH}/lib64)
44+
find_package(CUDA REQUIRED)
45+
46+
# setting compiler flags
47+
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS}")
48+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
49+
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler -Wall")
50+
51+
if (SM STREQUAL 70 OR
52+
SM STREQUAL 75 OR
53+
SM STREQUAL 61 OR
54+
SM STREQUAL 60)
55+
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_${SM},code=\\\"sm_${SM},compute_${SM}\\\" -rdc=true")
56+
if (SM STREQUAL 70 OR SM STREQUAL 75)
57+
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DWMMA")
58+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DWMMA")
59+
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
60+
endif()
61+
if(BUILD_THE OR BUILD_THS OR BUILD_THSOP)
62+
string(SUBSTRING ${SM} 0 1 SM_MAJOR)
63+
string(SUBSTRING ${SM} 1 1 SM_MINOR)
64+
set(ENV{TORCH_CUDA_ARCH_LIST} "${SM_MAJOR}.${SM_MINOR}")
65+
endif()
66+
message("-- Assign GPU architecture (sm=${SM})")
67+
68+
else()
69+
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} \
70+
-gencode=arch=compute_60,code=\\\"sm_60,compute_60\\\" \
71+
-gencode=arch=compute_61,code=\\\"sm_61,compute_61\\\" \
72+
-gencode=arch=compute_70,code=\\\"sm_70,compute_70\\\" \
73+
-gencode=arch=compute_75,code=\\\"sm_75,compute_75\\\" \
74+
-rdc=true")
75+
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DWMMA")
76+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DWMMA")
77+
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
78+
if(BUILD_THE OR BUILD_THS OR BUILD_THSOP)
79+
set(ENV{TORCH_CUDA_ARCH_LIST} "6.0;6.1;7.0;7.5")
80+
endif()
81+
message("-- Assign GPU architecture (sm=60,61,70,75)")
82+
endif()
83+
84+
set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -Wall -O0")
85+
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Wall -O0")
86+
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -O0 -G -Xcompiler -Wall --ptxas-options=-v --resource-usage")
87+
88+
set(CMAKE_CXX_STANDARD "${CXX_STD}")
89+
set(CMAKE_CXX_STANDARD_REQUIRED ON)
90+
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda")
91+
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
92+
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --std=c++${CXX_STD}")
93+
94+
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3")
95+
set(CMAKE_CUDA_FLAGS_RELEASE "${CMAKE_CUDA_FLAGS_RELEASE} -Xcompiler -O3")
96+
97+
set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
98+
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
99+
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
100+
101+
set(COMMON_HEADER_DIRS
102+
${PROJECT_SOURCE_DIR}
103+
${CUDA_PATH}/include
104+
)
105+
106+
set(COMMON_LIB_DIRS
107+
${CUDA_PATH}/lib64
108+
)
109+
110+
if(BUILD_TF)
111+
list(APPEND COMMON_HEADER_DIRS ${TF_PATH}/include)
112+
list(APPEND COMMON_LIB_DIRS ${TF_PATH})
113+
endif()
114+
115+
if(BUILD_TRT)
116+
list(APPEND COMMON_HEADER_DIRS ${TRT_PATH}/include)
117+
list(APPEND COMMON_LIB_DIRS ${TRT_PATH}/lib)
118+
endif()
119+
120+
set(PYTHON_PATH "python" CACHE STRING "Python path")
121+
if(BUILD_THE OR BUILD_THS OR BUILD_THSOP)
122+
execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; import os; import torch;
123+
print(os.path.dirname(torch.__file__),end='');"
124+
RESULT_VARIABLE _PYTHON_SUCCESS
125+
OUTPUT_VARIABLE TORCH_DIR)
126+
if (NOT _PYTHON_SUCCESS MATCHES 0)
127+
message(FATAL_ERROR "Torch config Error.")
128+
endif()
129+
list(APPEND CMAKE_PREFIX_PATH ${TORCH_DIR})
130+
find_package(Torch REQUIRED)
131+
132+
execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; from distutils import sysconfig;
133+
print(sysconfig.get_python_inc());
134+
print(sysconfig.get_config_var('SO'));"
135+
RESULT_VARIABLE _PYTHON_SUCCESS
136+
OUTPUT_VARIABLE _PYTHON_VALUES)
137+
if (NOT _PYTHON_SUCCESS MATCHES 0)
138+
message(FATAL_ERROR "Python config Error.")
139+
endif()
140+
string(REGEX REPLACE ";" "\\\\;" _PYTHON_VALUES ${_PYTHON_VALUES})
141+
string(REGEX REPLACE "\n" ";" _PYTHON_VALUES ${_PYTHON_VALUES})
142+
list(GET _PYTHON_VALUES 0 PY_INCLUDE_DIR)
143+
list(GET _PYTHON_VALUES 1 PY_SUFFIX)
144+
list(APPEND COMMON_HEADER_DIRS ${PY_INCLUDE_DIR})
145+
146+
execute_process(COMMAND ${PYTHON_PATH} "-c" "from torch.utils import cpp_extension; print(' '.join(cpp_extension._prepare_ldflags([],True,False)),end='');"
147+
RESULT_VARIABLE _PYTHON_SUCCESS
148+
OUTPUT_VARIABLE TORCH_LINK)
149+
if (NOT _PYTHON_SUCCESS MATCHES 0)
150+
message(FATAL_ERROR "PyTorch link config Error.")
151+
endif()
152+
endif()
153+
154+
155+
include_directories(
156+
${COMMON_HEADER_DIRS}
157+
)
158+
159+
link_directories(
160+
${COMMON_LIB_DIRS}
161+
)
162+
163+
add_subdirectory(tools)
164+
add_subdirectory(fastertransformer)
165+
add_subdirectory(sample)
166+
167+
if(BUILD_TF)
168+
add_custom_target(copy ALL COMMENT "Copying tensorflow test scripts")
169+
add_custom_command(TARGET copy
170+
POST_BUILD
171+
COMMAND cp ${PROJECT_SOURCE_DIR}/sample/tensorflow/ ${PROJECT_BINARY_DIR} -r
172+
COMMAND cp ${PROJECT_SOURCE_DIR}/sample/tensorflow_bert ${PROJECT_BINARY_DIR}/tensorflow -r
173+
)
174+
endif()
175+
176+
if(BUILD_THE OR BUILD_THS OR BUILD_THSOP)
177+
add_custom_target(copy ALL COMMENT "Copying pytorch test scripts")
178+
add_custom_command(TARGET copy
179+
POST_BUILD
180+
COMMAND cp ${PROJECT_SOURCE_DIR}/sample/pytorch/ ${PROJECT_BINARY_DIR} -r
181+
COMMAND mkdir -p ${PROJECT_BINARY_DIR}/pytorch/translation/data/
182+
COMMAND cp ${PROJECT_SOURCE_DIR}/sample/tensorflow/utils/translation/test.* ${PROJECT_BINARY_DIR}/pytorch/translation/data/
183+
)
184+
endif()

0 commit comments

Comments
 (0)