Skip to content

Commit 4b09b34

Browse files
committed
BERT README update
1 parent 21024d9 commit 4b09b34

1 file changed

Lines changed: 36 additions & 38 deletions

File tree

TensorFlow/LanguageModeling/BERT/README.md

Lines changed: 36 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,13 @@ This repository provides a script and recipe to train BERT to achieve state of t
2424
* [Training accuracy results](#training-accuracy-results)
2525
* [Training stability test](#training-stability-test)
2626
* [Training performance results](#training-performance-results)
27-
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
28-
* [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-8x-v100-32g)
29-
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-1-16x-v100-32g)
27+
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
28+
* [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-8x-v100-32g)
29+
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-16x-v100-32g)
3030
* [Inference performance results](#inference-performance-results)
31-
* [NVIDIA DGX-1 16G (1x V100 16G)](#nvidia-dgx-1-16g-1x-v100-16g)
32-
* [NVIDIA DGX-1 32G (1x V100 32G)](#nvidia-dgx-1-32g-1x-v100-32g)
33-
* [NVIDIA DGX-2 32G (1x V100 32G)](#nvidia-dgx-1-32g-1x-v100-32g)
31+
* [NVIDIA DGX-1 16G (1x V100 16G)](#nvidia-dgx-1-16g-1x-v100-16g)
32+
* [NVIDIA DGX-1 32G (1x V100 32G)](#nvidia-dgx-1-32g-1x-v100-32g)
33+
* [NVIDIA DGX-2 32G (1x V100 32G)](#nvidia-dgx-2-32g-1x-v100-32g)
3434
* [Changelog](#changelog)
3535
* [Known issues](#known-issues)
3636

@@ -120,7 +120,7 @@ After you build the container image and download the data, you can start an inte
120120
bash scripts/docker/launch.sh
121121
```
122122

123-
The `interactive.sh` script assumes that the datasets are in the following locations by default after downloading data.
123+
The `launch.sh` script assumes that the datasets are in the following locations by default after downloading data.
124124
- SQuaD v1.1 - `data/squad/v1.1`
125125
- BERT - `data/pretrained_models_google/uncased_L-24_H-1024_A-16`
126126
- Wikipedia - `data/wikipedia_corpus/final_tfrecords_sharded`
@@ -194,8 +194,8 @@ Aside from options to set hyperparameters, the relevant options to control the b
194194
--[no]amp: Whether to enable AMP ops.(default: 'false')
195195
--[no]amp_fastmath: Whether to enable AMP fasthmath ops.(default: 'false')
196196
--bert_config_file: The config json file corresponding to the pre-trained BERT model. This specifies the model architecture.
197-
--[no]do_eval: Whether to run eval on the dev set.(default: 'false')
198-
--[no]do_train: Whether to run training.(default: 'false')
197+
--[no]do_eval: Whether to run evaluation on the dev set.(default: 'false')
198+
--[no]do_train: Whether to run training.(evaluation: 'false')
199199
--eval_batch_size: Total batch size for eval.(default: '8')(an integer)
200200
--[no]fastmath: Whether to enable loss scaler for fasthmath ops.(default: 'false')
201201
--[no]horovod: Whether to use Horovod for multi-gpu runs(default: 'false')
@@ -207,7 +207,7 @@ Aside from options to set hyperparameters, the relevant options to control the b
207207
Aside from options to set hyperparameters, some relevant options to control the behaviour of the run_squad.py script are:
208208
```bash
209209
--bert_config_file: The config json file corresponding to the pre-trained BERT model. This specifies the model architecture.
210-
--[no]do_predict: Whether to run eval on the dev set. (default: 'false')
210+
--[no]do_predict: Whether to run evaluation on the dev set. (default: 'false')
211211
--[no]do_train: Whether to run training. (default: 'false')
212212
--learning_rate: The initial learning rate for Adam.(default: '5e-06')(a number)
213213
--max_answer_length: The maximum length of an answer that can be generated. This is needed because the start and end predictions are not conditioned on one another.(default: '30')(an integer)
@@ -234,15 +234,13 @@ Pre-training is performed using the `run_pretraining.py` script along with param
234234

235235

236236
The `run_pretraining.sh` script runs a job on a single node that trains the BERT-large model from scratch using the Wikipedia and Book corpus datasets as training data. By default, the training script:
237-
- Assumes training batch size of 14
238-
- Assumes evaluation batch size of 8
239-
- Assumes learning rate of 1e-4
240-
- Assumes precision of fp16_xla (fp16 math JIT compiled with XLA)
241-
- Assumes you want to run on 8 GPUs
242-
- Assumes 10,000 warmup steps
243-
- Assumes 1144000 training steps
244-
- Assumes checkpoints should be saved every 5000 steps
245-
- Assumes you do want to create a log file for all the output
237+
- Runs on 8 GPUs with training batch size of 14 and evaluation batch size of 8 per GPU.
238+
- Has FP16 precision enabled.
239+
- Is XLA enabled.
240+
- Runs for 1144000 steps with 10000 warm-up steps.
241+
- Saves a checkpoint every 5000 iterations (keeps only the latest checkpoint) and at the end of training. All checkpoints, evaluation results and training logs are saved to the `/results` directory (in the container which can be mounted to a local directory).
242+
- Creates the log file containing all the output.
243+
- Evaluates the model at the end of training. To skip evaluation, modify `--do_eval` to `False`.
246244

247245
These parameters will train Wikipedia + BooksCorpus to reasonable accuracy on a DGX1 with 32GB V100 cards. If you want to match google’s best results from the BERT paper, you should either train for twice as many steps (2,288,000 steps) on a DGX1, or train on 16 GPUs on a DGX2. The DGX2 having 16 GPUs will be able to fit a batch size twice as large as a DGX1 (224 vs 112), hence the DGX2 can finish in half as many steps.
248246

@@ -253,7 +251,7 @@ run_pretraining.sh <node_type> <training_batch_size> <eval_batch_size> <learning
253251
```
254252

255253
Where:
256-
- <training_batch_size> Batch size varies with <precision>, larger batch sizes run more efficiently, but require more memory.
254+
- <training_batch_size> is per-gpu batch size used for training. Batch size varies with <precision>, larger batch sizes run more efficiently, but require more memory.
257255

258256
- <eval_batch_size> per-gpu batch size used for evaluation after training.<learning_rate> Default rate of 1e-4 is good for global batch size 256.
259257

@@ -297,16 +295,16 @@ Trains BERT-large from scratch on a single DGX-2 using FP16 arithmetic. This wil
297295
Fine tuning is performed using the `run_squad.py` script along with parameters defined in `scripts/run_squad.sh`.
298296

299297
The `run_squad.sh` script trains a model and performs evaluation on the SQuaD v1.1 dataset. By default, the training script:
300-
- Uses 8 GPUs and batch size of 10 on each GPU
301-
- Has FP16 precision enabled
302-
- Is XLA enabled
303-
- Runs for 2 epochs
298+
- Uses 8 GPUs and batch size of 10 on each GPU.
299+
- Has FP16 precision enabled.
300+
- Is XLA enabled.
301+
- Runs for 2 epochs.
304302
- Saves a checkpoint every 1000 iterations (keeps only the latest checkpoint) and at the end of training. All checkpoints, evaluation results and training logs are saved to the `/results` directory (in the container which can be mounted to a local directory).
305-
- Evaluation is done at the end of training. To skip eval, modify `--do_predict` to `False`.
303+
- Evaluation is done at the end of training. To skip evaluation, modify `--do_predict` to `False`.
306304

307305
This script outputs checkpoints to the `/results` directory, by default, inside the container. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file. The training log contains information about:
308-
- Loss for final step
309-
- Train and eval performance
306+
- Loss for the final step
307+
- Training and evaluation performance
310308
- F1 and exact match score on the Dev Set of SQuaD after evaluation.
311309

312310
The summary after training is printed in the following format:
@@ -347,12 +345,12 @@ Inference on a fine tuned Question Answering system is performed using the `run_
347345
The `run_squad_inference.sh` script trains a model and performs evaluation on the SQuaD v1.1 dataset. By default, the inferencing script:
348346
- Has FP16 precision enabled
349347
- Is XLA enabled
350-
- Does eval on latest checkpoint present in `/results` with a batch size of 8
348+
- Evaluates the latest checkpoint present in `/results` with a batch size of 8
351349

352350
This script outputs predictions file to `/results/predictions.json` and computes F1 score and exact match score using SQuaD's `evaluate-v1.1.py`. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file.
353351

354352
The output log contains information about:
355-
- Eval performance
353+
- Evaluation performance
356354
- F1 and exact match score on the Dev Set of SQuaD after evaluation.
357355

358356
The summary after inference is printed in the following format:
@@ -412,14 +410,14 @@ Our results were obtained by running batch sizes up to 3x GPUs on a 16GB V100 an
412410
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in tokens per second) were averaged over an entire training epoch.
413411

414412

415-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
413+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
416414
|:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
417415
| 1 | 2 | 7.41 |11.86|1.6 |1.0 |1.0 |
418416
| 4 | 2 |23.699|35.34|1.49|3.2 |2.98|
419417
| 8 | 2 |44.29 |64.96|1.47|5.98|5.48|
420418

421419

422-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
420+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
423421
|:---:|:---:|:-----:|:-----:|:---:|:---:|:----:|
424422
| 1 | 3 | - |14.86| - | - |1.0 |
425423
| 4 | 3 | - |44.17| - | - |2.97|
@@ -433,14 +431,14 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
433431
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epochs.
434432

435433

436-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
434+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
437435
|---|---|-----|-----|----|----|----|
438436
| 1 | 4 | 8.55|18.14|2.12|1.0 |1.0 |
439437
| 4 | 4 |32.13|52.85|1.64|3.76|2.91|
440438
| 8 | 4 |62.83|95.28|1.51|7.35|5.25|
441439

442440

443-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
441+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
444442
|---|---|-----|-------|---|---|----|
445443
| 1 | 10| - | 27.69 | - | - |1.0 |
446444
| 4 | 10| - | 85.193| - | - |3.07|
@@ -455,15 +453,15 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
455453
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epoch.
456454

457455

458-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
456+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
459457
|---|---|------|------|----|-----|----|
460458
| 1| 4 | 8.80| 17.43|1.98| 1.0 |1.0 |
461459
| 4| 4 | 33.22| 56.87|1.71| 3.78|3.26|
462460
| 8| 4 | 64.46|100.58|1.56| 7.33|5.77|
463461
| 16| 4 |117.83|162.29|1.38|13.39|9.31|
464462

465463

466-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
464+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with FP32** | **Multi-gpu weak scaling with FP16** |
467465
|---|---|---|------|---|---|----|
468466
| 1| 10| - | 28.72| - | - |1.0 |
469467
| 4| 10| - | 92.73| - | - |3.22|
@@ -479,7 +477,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
479477
#### NVIDIA DGX-1 16G (1x V100 16G)
480478
Our results were obtained by running the `scripts/run_squad_inference.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epoch.
481479

482-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speedup** |
480+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speedup** |
483481
|---|---|-----|------|----|
484482
| 1 | 8 |41.04|112.55|2.74|
485483

@@ -489,7 +487,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
489487
#### NVIDIA DGX-1 32G (1x V100 32G)
490488
Our results were obtained by running the `scripts/run_squad_inference.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epoch.
491489

492-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speedup** |
490+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speedup** |
493491
|---|---|-----|------|----|
494492
| 1 | 8 |36.78|118.54|3.22|
495493

@@ -498,7 +496,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
498496
#### NVIDIA DGX-2 32G (1x V100 32G)
499497
Our results were obtained by running the `scripts/run_squad_inference.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epoch.
500498

501-
| **Number of GPUs** | **Batch size per GPU** | **FP 32 sentences/sec** | **FP16 sentences/sec** | **Speedup** |
499+
| **Number of GPUs** | **Batch size per GPU** | **FP32 sentences/sec** | **FP16 sentences/sec** | **Speedup** |
502500
|---|---|-----|------|----|
503501
| 1 | 8 |33.95|108.45|3.19|
504502

0 commit comments

Comments
 (0)