Skip to content

Commit bd89bca

Browse files
committed
[FasterTransformer] Add translation_sample, fix some bugs.
1 parent b162523 commit bd89bca

26 files changed

Lines changed: 38800 additions & 205 deletions

FasterTransformer/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,16 @@ FasterTransformer V1 will be deprecated on July 2020.
3737

3838
### Changelog
3939

40+
March 2020
41+
- Add feature in FasterTransformer 2.0
42+
- Fix the bug of maximum sequence length of decoder cannot be larger than 128.
43+
- Add `translate_sample.py` to demonstrate how to translate a sentence by restoring the pretrained model of OpenNMT-tf.
44+
- Fix the bug that decoding does not check finish or not after each step.
45+
- Fix the bug of decoder about max_seq_len.
46+
- Modify the decoding model structure to fit the OpenNMT-tf decoding model.
47+
- Add a layer normalization layer after decoder.
48+
- Add a normalization for inputs of decoder
49+
4050
February 2020
4151
* Release the FasterTransformer 2.0
4252
* Provide a highly optimized OpenNMT-tf based decoder and decoding, including C++ API and TensorFlow OP.

FasterTransformer/v2/README.md

Lines changed: 58 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ This repository provides a script and recipe to run the highly optimized transfo
2323
* [Inference process](#inference-process)
2424
* [Encoder process](#encoder-process)
2525
* [Decoder and Decoding process](#decoder-and-decoding-process)
26+
* [Translation process](#translation-process)
2627
- [Performance](#performance)
2728
* [Encoder performance](#encoder-performance)
2829
* [Decoder performance on T4](#decoder-performance-on-t4)
@@ -51,7 +52,7 @@ FasterTransformer is built on top of CUDA and cuBLAS, providing the C++ API and
5152

5253
The following configurations are supported in the FasterTransformer encoder.
5354
- Batch size (B<sub>1</sub>): smaller or equal to 512
54-
- Sequence length (S): smaller or equal to 128
55+
- Sequence length (S): larger than 3 and smaller or equal to 1024
5556
- Head number (H) and size per head (N):
5657
- 12 heads * 64 per heads
5758
- 4 heads * 32 per heads
@@ -60,7 +61,7 @@ The following configurations are supported in the FasterTransformer encoder.
6061

6162
The following configurations are supported in the FasterTransformer decoder and decoding.
6263
- Batch size (B<sub>1</sub>) * beam width (B<sub>2</sub>): smaller than 1024
63-
- Sequence length (S): smaller or equal to 128
64+
- Sequence length (S): smaller than 1024
6465
- Head number (H): 8 and 12
6566
- Size per head (N): 64
6667
- Vocabulary size (V): from 64 to 30000
@@ -154,10 +155,9 @@ nvidia-docker run -ti nvcr.io/nvidia/tensorflow:19.07-py2 bash
154155

155156
```bash
156157
git clone https://github.com/NVIDIA/DeepLearningExamples
157-
cd DeepLearningExamples
158+
cd DeepLearningExamples/FasterTransformer/v2
158159
git submodule init
159160
git submodule update
160-
cd FasterTransformer/v2
161161
```
162162

163163
3. Build the project.
@@ -356,6 +356,7 @@ The `sample/` folder contains useful sample codes for FasterTransformer:
356356
* `sample/tensorflow/decoding_sample.py` - TensorFlow decoding sample codes
357357
* `sample/tensorflow/encoder_decoder_sample.py` - TensorFlow `encoder_decoder` sample codes
358358
* `sample/tensorflow/encoder_decoding_sample.py` - TensorFlow `encoder_decoding` sample codes
359+
* `sample/tensorflow/translate_sample.py` - TensorFlow translation sample codes
359360

360361
### Command-line options
361362

@@ -367,6 +368,7 @@ python decoder_sample.py --help
367368
python decoding_sample.py --help
368369
python encoder_decoder_sample.py --help
369370
python encoder_decoding_sample.py --help
371+
python translate_sample.py --help
370372
```
371373

372374
### Inference process
@@ -540,14 +542,16 @@ python decoder_sample.py \
540542
The outputs should be similar to the following:
541543

542544
```bash
543-
[[INFO][PYTHON] step:][1][max diff: ][9.77516174e-06][True]
544-
[[INFO][PYTHON] step:][2][max diff: ][1.04904175e-05][True]
545+
[[INFO][PYTHON] step:][0][max diff: ][5.00679e-06][ op val: ][2.3735888][ tf val: ][2.37359381][True]
546+
[[INFO][PYTHON] step:][1][max diff: ][4.64916229e-06][ op val: ][-0.588810563][ tf val: ][-0.588815212][True]
547+
[[INFO][PYTHON] step:][2][max diff: ][5.36441803e-06][ op val: ][-1.46514082][ tf val: ][-1.46514618][True]
545548
...
546-
[[INFO][PYTHON] step:][31][max diff: ][1.21593475e-05][True]
547-
[[INFO][PYTHON] step:][32][max diff: ][1.04382634e-05][True]
549+
[[INFO][PYTHON] step:][29][max diff: ][4.529953e-06][ op val: ][2.88768935][ tf val: ][2.88769388][True]
550+
[[INFO][PYTHON] step:][30][max diff: ][4.17232513e-06][ op val: ][-1.28717053][ tf val: ][-1.2871747][True]
551+
[[INFO][PYTHON] step:][31][max diff: ][4.05311584e-06][ op val: ][-1.01830876][ tf val: ][-1.01831281][True]
548552
```
549553

550-
The results show that the differences between the decoder of TensorFlow and decoder are smaller than threshold.
554+
The results show that the differences between the decoder of TensorFlow and decoder are smaller than threshold. Note that the differences are absolute differences, so the differences may be large when the op val is large. In this case, the differences are larger than the threshold and the checking will return "False", but it may be not affect the final results.
551555

552556
The option `decoder_type` decides to use the decoder of TensorFlow or decoder of FasterTransformer. `decoder_type 2` uses both decoders and compares their results.
553557

@@ -606,15 +610,13 @@ python decoding_sample.py \
606610
The outputs should be similar to the following:
607611

608612
```bash
609-
[INFO] Before finalize:
610-
result before finalize cross-check: True
613+
Output ids cross-check: True
611614

612615
Parent ids cross-check: True
613616

614-
sequence lengths cross-check: True
617+
Sequence lengths cross-check: True
615618

616-
[INFO] After finalize:
617-
result after cross-check: True
619+
Finalized output ids cross-check: True
618620
```
619621

620622
Note that the results of OP and the results of TensorFlow are often different in the random inputs and weights.
@@ -635,6 +637,34 @@ python encoder_decoding_sample.py \
635637
--data_type fp32
636638
```
637639

640+
#### Translation progress
641+
642+
For translation, we need to use some tools and library of OpenNMT-tf to prepocess the source sentence and build the encoder.
643+
Because the encoder of FasterTransformer is based on BERT, it cannot be restore the pretrained model. So, it requires to use the encoder of OpenNMT-tf.
644+
645+
1. Prepare the pretrained model and the data for translation.
646+
647+
```bash
648+
bash utils/translation/download_model_data.sh
649+
```
650+
651+
`download_model_data.sh` will prepare the `opennmt` folder, which contains the input embedding and the encoder, download the pretrained model, and download the test data into the `translation` folder. This is because the encoder of FasterTransformer is based on BERT, but not OpenNMT-tf, so we cannot restore the pretrained model of OpenNMT-tf for encoder. Therefore, translation requires the encoder of OpenNMT-tf.
652+
653+
Another problem is that the implementation of our tf_decoding and OpenNMT-tf decoding is a little different. For example, OpenNMT-tf uses one gemm to compute query, key and values in one time; but tf_decoding splits them into three gemms. So, the tool `utils/dump_model.py` will convert the pretrained model to fit the model structure of decoder of FasterTransformer.
654+
655+
```bash
656+
./bin/decoding_gemm 1 4 8 64 32001 100 512 0
657+
python translate_sample.py
658+
```
659+
660+
The outputs should be similar to the following:
661+
662+
```bash
663+
[INFO] opennmt: ▁28 - jährige r ▁Chef koch ▁to t ▁in ▁San ▁Francisco </s>
664+
[INFO] tf : ▁28 - jährige r ▁Chef koch ▁to t ▁in ▁San ▁Francisco </s>
665+
[INFO] op : ▁28 - jährige r ▁Chef koch ▁to t ▁in ▁San ▁Francisco </s>
666+
```
667+
638668
## Performance
639669

640670
Hardware settings:
@@ -752,6 +782,16 @@ bash scripts/profile_decoding_op_performance.sh
752782

753783
### Changelog
754784

785+
March 2020
786+
- Add feature in FasterTransformer 2.0
787+
- Fix the bug of maximum sequence length of decoder cannot be larger than 128.
788+
- Add `translate_sample.py` to demonstrate how to translate a sentence by restoring the pretrained model of OpenNMT-tf.
789+
- Fix the bug that decoding does not check finish or not after each step.
790+
- Fix the bug of decoder about max_seq_len.
791+
- Modify the decoding model structure to fit the OpenNMT-tf decoding model.
792+
- Add a layer normalization layer after decoder.
793+
- Add a normalization for inputs of decoder
794+
755795
Febuary 2020
756796
- Release the FasterTransformer 2.0
757797
- Provide a highly optimized OpenNMT-tf based decoder and decoding, including C++ API and TensorFlow op.
@@ -764,10 +804,8 @@ July 2019
764804

765805
### Known issues
766806

767-
- sequence length of Decoder and Decoding should be smaller than 128.
768-
- batch_size should be smaller than 1024 in Decoder.
769-
- batch_size x beam_width should be smaller than 1024 in Decoding.
770-
- Results of TensorFlow and OP would be different in decoding. This problem is caused by the accumulated log probability, and we do not avoid this problem.
807+
- batch_size should be smaller or equal to 1024 in Decoder.
808+
- batch_size x beam_width should be smaller or equal to 1024 in Decoding.
809+
- Results of TensorFlow and OP would be different in decoding. This problem is caused by the accumulated log probability, and we do not avoid this problem.
771810
- Cmake 15 or Cmake 16 fail to build this project. Cmake 14 is no problem.
772-
- Max sequence length of encoder and decoder should be the same.
773-
811+
- Max sequence length of encoder and decoder should be the same.

FasterTransformer/v2/fastertransformer/allocator.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ class Allocator<AllocatorType::TF> : public IAllocator
104104

105105
auto flat = buf.flat<uint8>();
106106
void *ptr = (void *)flat.data();
107-
cudaMemset(ptr, 0, size);
107+
cudaMemset(ptr, 0, buf_size);
108108
return ptr;
109109
}
110110

FasterTransformer/v2/fastertransformer/beamsearch_opennmt.h

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,9 @@ void BeamSearch_OpenNMT(
4040
int *output_ids,
4141
const int batch_size, const int beam_width,
4242
const int vocab_size, const int hidden_dim, const int step,
43-
const int cache_size, const int decoder_layers, cudaStream_t stream)
43+
const int cache_size, const int decoder_layers, cudaStream_t stream,
44+
const int end_id,
45+
int *finished_count)
4446
{
4547
#ifdef NDEBUG
4648
/* adding cum_log_probs to log_probs */
@@ -75,11 +77,15 @@ void BeamSearch_OpenNMT(
7577
#endif
7678

7779
#ifdef NDEBUG
78-
update(log_probs, cum_log_probs, ids, finished, parent_ids, sequence_length, word_ids, output_ids,
79-
batch_size, beam_width, vocab_size, stream);
80+
update(log_probs, cum_log_probs, ids, finished,
81+
parent_ids, sequence_length, word_ids, output_ids,
82+
batch_size, beam_width, vocab_size, stream,
83+
end_id, finished_count);
8084
#else
81-
update(log_probs, cum_log_probs, ids, finished, parent_ids, sequence_length, word_ids, output_ids,
82-
batch_size, beam_width, vocab_size, stream);
85+
update(log_probs, cum_log_probs, ids, finished,
86+
parent_ids, sequence_length, word_ids, output_ids,
87+
batch_size, beam_width, vocab_size, stream,
88+
end_id, finished_count);
8389
cudaDeviceSynchronize();
8490
check_cuda_error(cudaGetLastError());
8591

@@ -89,13 +95,17 @@ void BeamSearch_OpenNMT(
8995
Note that update_kernel_check contains update and uses do not need to call it again.
9096
*/
9197
// update_kernel_check(log_probs, cum_log_probs, ids, finished, parent_ids, sequence_length, word_ids, output_ids,
92-
// batch_size, beam_width, vocab_size, stream);
98+
// batch_size, beam_width, vocab_size, stream, end_id, finished_count);
9399
#endif
94100

95101
#ifdef NDEBUG
96-
update_KV_cache<T>(key_cache, value_cache, parent_ids, batch_size, beam_width, hidden_dim, step, cache_size, decoder_layers, stream);
102+
update_KV_cache<T>(key_cache, value_cache, parent_ids, batch_size,
103+
beam_width, hidden_dim, step, cache_size,
104+
decoder_layers, stream);
97105
#else
98-
update_KV_cache<T>(key_cache, value_cache, parent_ids, batch_size, beam_width, hidden_dim, step, cache_size, decoder_layers, stream);
106+
update_KV_cache<T>(key_cache, value_cache, parent_ids, batch_size,
107+
beam_width, hidden_dim, step, cache_size,
108+
decoder_layers, stream);
99109
cudaDeviceSynchronize();
100110
check_cuda_error(cudaGetLastError());
101111

FasterTransformer/v2/fastertransformer/cuda/cuda_kernels.cu

Lines changed: 33 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -69,13 +69,12 @@ T blockReduceSum(T val)
6969
if(lane == 0)
7070
shared[wid] = val;
7171
__syncthreads();
72-
72+
7373
val = (threadIdx.x < (blockDim.x >> 5 )) ? shared[lane] : (T)0.0f;
7474
val = warpReduceSum(val);
7575
return val;
7676
}
7777

78-
7978
template <typename T>
8079
__inline__ __device__
8180
T warpReduceMax(T val)
@@ -386,12 +385,15 @@ void topK(const float* log_probs, int* ids, const int batch_size, const int beam
386385

387386
template <typename T>
388387
__global__
389-
void update_kernel(T* log_probs, T* cum_log_probs, int* ids, bool* finished, int* parent_ids, int* sequence_length,
390-
int* word_ids, int* output_ids,
391-
const int batch_size, const int beam_width, const int vocab_size)
388+
void update_kernel(T* log_probs, T* cum_log_probs,
389+
int* ids, bool* finished,
390+
int* parent_ids, int* sequence_length,
391+
int* word_ids, int* output_ids,
392+
const int batch_size, const int beam_width,
393+
const int vocab_size, const int end_id,
394+
int* finished_count)
392395
{
393396
int tid = threadIdx.x;
394-
395397
sequence_length[tid] = finished[tid] ? sequence_length[tid] : sequence_length[tid] + 1;
396398

397399
int beam_id = ids[tid];
@@ -401,10 +403,14 @@ void update_kernel(T* log_probs, T* cum_log_probs, int* ids, bool* finished, int
401403

402404
cum_log_probs[tid] = log_probs[ids[tid]];
403405
sequence_length[tid] = sequence_length[beam_id];
404-
finished[tid] = finished[beam_id];
406+
finished[tid] = word_id == end_id ? 1 : 0;
405407
parent_ids[tid] = beam_id;
406408
word_ids[tid] = word_id;
407409
output_ids[tid] = word_id;
410+
411+
// TODO use reduce sum to compute how many sentence are finished
412+
// int fi = finished[tid]
413+
// int total_finish = reduceSum(fi);
408414
}
409415

410416
template <typename T>
@@ -415,19 +421,25 @@ __global__ void embedding_lookup_kernel(const T* embedding_table, const int* wor
415421
from_tensor[write_pos] = embedding_table[word_ids[blockIdx.x] * hidden_units + threadIdx.x];
416422
}
417423

418-
void update(float* log_probs, float* cum_log_probs, int* ids, bool* finished, int* parent_ids, int* sequence_length,
419-
int* word_ids, int* output_ids,
420-
const int batch_size, const int beam_width, const int vocab_size, cudaStream_t stream)
424+
void update(float* log_probs, float* cum_log_probs,
425+
int* ids, bool* finished,
426+
int* parent_ids, int* sequence_length,
427+
int* word_ids, int* output_ids,
428+
const int batch_size, const int beam_width,
429+
const int vocab_size, cudaStream_t stream,
430+
const int end_id, int* finished_count)
421431
{
422432

423433
dim3 grid(1);
424434
dim3 block(batch_size * beam_width);
425435

426436
assert(block.x <= 1024);
427437

428-
update_kernel<float><<<grid, block, 0, stream>>>(log_probs, cum_log_probs, ids, finished, parent_ids, sequence_length,
429-
word_ids, output_ids,
430-
batch_size, beam_width, vocab_size);
438+
update_kernel<float><<<grid, block, 0, stream>>>(log_probs, cum_log_probs, ids,
439+
finished, parent_ids, sequence_length,
440+
word_ids, output_ids, batch_size,
441+
beam_width, vocab_size, end_id,
442+
finished_count);
431443
}
432444

433445
template <typename T>
@@ -565,14 +577,17 @@ __global__
565577
void sine_position_encoder_kernel(T* output, int step, int n){
566578
int tid = threadIdx.x;
567579
int bid = blockIdx.x;
568-
int half_n = n / 2;
580+
float half_n = (float)n / 2.;
581+
582+
// input = input * hidden_dim**0.5
583+
output[bid * n + tid] = output[bid * n + tid] * (T)sqrtf(float(n));
569584

570-
float log_timescale_increment = __logf(10000) / (( half_n - 1) * 1.f);
571-
float inv_timescales = __expf( (tid % half_n) * -1 * log_timescale_increment );
585+
float log_timescale_increment = __logf(10000) / (half_n - 1.f);
586+
float inv_timescales = __expf( (tid % (int)half_n) * -1 * log_timescale_increment );
572587
float scaled_time = inv_timescales * step;
573588

574589
T encoding_val = (tid < half_n) ? (T) __sinf(scaled_time) : (T) __cosf(scaled_time);
575-
output[bid * n + tid] = output[bid * n + tid] + encoding_val;
590+
output[bid * n + tid] = output[bid * n + tid] + encoding_val;
576591
}
577592

578593
template<typename T>
@@ -584,7 +599,7 @@ void sine_position_encoder(
584599
dim3 grid(m);
585600
dim3 block(n);
586601
assert(n <= 1024);
587-
sine_position_encoder_kernel<T><<<grid, block, 0, stream>>>(output, step + 1, n);
602+
sine_position_encoder_kernel<T><<<grid, block, 0, stream>>>(output, step, n);
588603
}
589604

590605
template void add_bias_act_kernelLauncher<float>(

0 commit comments

Comments
 (0)