You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Wikipedia - `data/wikipedia_corpus/final_tfrecords_sharded`
@@ -194,8 +194,8 @@ Aside from options to set hyperparameters, the relevant options to control the b
194
194
--[no]amp: Whether to enable AMP ops.(default: 'false')
195
195
--[no]amp_fastmath: Whether to enable AMP fasthmath ops.(default: 'false')
196
196
--bert_config_file: The config json file corresponding to the pre-trained BERT model. This specifies the model architecture.
197
-
--[no]do_eval: Whether to run eval on the dev set.(default: 'false')
198
-
--[no]do_train: Whether to run training.(default: 'false')
197
+
--[no]do_eval: Whether to run evaluation on the dev set.(default: 'false')
198
+
--[no]do_train: Whether to run training.(evaluation: 'false')
199
199
--eval_batch_size: Total batch size for eval.(default: '8')(an integer)
200
200
--[no]fastmath: Whether to enable loss scaler for fasthmath ops.(default: 'false')
201
201
--[no]horovod: Whether to use Horovod for multi-gpu runs(default: 'false')
@@ -207,7 +207,7 @@ Aside from options to set hyperparameters, the relevant options to control the b
207
207
Aside from options to set hyperparameters, some relevant options to control the behaviour of the run_squad.py script are:
208
208
```bash
209
209
--bert_config_file: The config json file corresponding to the pre-trained BERT model. This specifies the model architecture.
210
-
--[no]do_predict: Whether to run eval on the dev set. (default: 'false')
210
+
--[no]do_predict: Whether to run evaluation on the dev set. (default: 'false')
211
211
--[no]do_train: Whether to run training. (default: 'false')
212
212
--learning_rate: The initial learning rate for Adam.(default: '5e-06')(a number)
213
213
--max_answer_length: The maximum length of an answer that can be generated. This is needed because the start and end predictions are not conditioned on one another.(default: '30')(an integer)
@@ -234,15 +234,13 @@ Pre-training is performed using the `run_pretraining.py` script along with param
234
234
235
235
236
236
The `run_pretraining.sh` script runs a job on a single node that trains the BERT-large model from scratch using the Wikipedia and Book corpus datasets as training data. By default, the training script:
237
-
- Assumes training batch size of 14
238
-
- Assumes evaluation batch size of 8
239
-
- Assumes learning rate of 1e-4
240
-
- Assumes precision of fp16_xla (fp16 math JIT compiled with XLA)
241
-
- Assumes you want to run on 8 GPUs
242
-
- Assumes 10,000 warmup steps
243
-
- Assumes 1144000 training steps
244
-
- Assumes checkpoints should be saved every 5000 steps
245
-
- Assumes you do want to create a log file for all the output
237
+
- Runs on 8 GPUs with training batch size of 14 and evaluation batch size of 8 per GPU.
238
+
- Has FP16 precision enabled.
239
+
- Is XLA enabled.
240
+
- Runs for 1144000 steps with 10000 warm-up steps.
241
+
- Saves a checkpoint every 5000 iterations (keeps only the latest checkpoint) and at the end of training. All checkpoints, evaluation results and training logs are saved to the `/results` directory (in the container which can be mounted to a local directory).
242
+
- Creates the log file containing all the output.
243
+
- Evaluates the model at the end of training. To skip evaluation, modify `--do_eval` to `False`.
246
244
247
245
These parameters will train Wikipedia + BooksCorpus to reasonable accuracy on a DGX1 with 32GB V100 cards. If you want to match google’s best results from the BERT paper, you should either train for twice as many steps (2,288,000 steps) on a DGX1, or train on 16 GPUs on a DGX2. The DGX2 having 16 GPUs will be able to fit a batch size twice as large as a DGX1 (224 vs 112), hence the DGX2 can finish in half as many steps.
- <training_batch_size> Batch size varies with <precision>, larger batch sizes run more efficiently, but require more memory.
254
+
- <training_batch_size> is per-gpu batch size used for training. Batch size varies with <precision>, larger batch sizes run more efficiently, but require more memory.
257
255
258
256
- <eval_batch_size> per-gpu batch size used for evaluation after training.<learning_rate> Default rate of 1e-4 is good for global batch size 256.
259
257
@@ -297,16 +295,16 @@ Trains BERT-large from scratch on a single DGX-2 using FP16 arithmetic. This wil
297
295
Fine tuning is performed using the `run_squad.py` script along with parameters defined in `scripts/run_squad.sh`.
298
296
299
297
The `run_squad.sh` script trains a model and performs evaluation on the SQuaD v1.1 dataset. By default, the training script:
300
-
- Uses 8 GPUs and batch size of 10 on each GPU
301
-
- Has FP16 precision enabled
302
-
- Is XLA enabled
303
-
- Runs for 2 epochs
298
+
- Uses 8 GPUs and batch size of 10 on each GPU.
299
+
- Has FP16 precision enabled.
300
+
- Is XLA enabled.
301
+
- Runs for 2 epochs.
304
302
- Saves a checkpoint every 1000 iterations (keeps only the latest checkpoint) and at the end of training. All checkpoints, evaluation results and training logs are saved to the `/results` directory (in the container which can be mounted to a local directory).
305
-
- Evaluation is done at the end of training. To skip eval, modify `--do_predict` to `False`.
303
+
- Evaluation is done at the end of training. To skip evaluation, modify `--do_predict` to `False`.
306
304
307
305
This script outputs checkpoints to the `/results` directory, by default, inside the container. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file. The training log contains information about:
308
-
- Loss for final step
309
-
-Train and eval performance
306
+
- Loss for the final step
307
+
-Training and evaluation performance
310
308
- F1 and exact match score on the Dev Set of SQuaD after evaluation.
311
309
312
310
The summary after training is printed in the following format:
@@ -347,12 +345,12 @@ Inference on a fine tuned Question Answering system is performed using the `run_
347
345
The `run_squad_inference.sh` script trains a model and performs evaluation on the SQuaD v1.1 dataset. By default, the inferencing script:
348
346
- Has FP16 precision enabled
349
347
- Is XLA enabled
350
-
-Does eval on latest checkpoint present in `/results` with a batch size of 8
348
+
-Evaluates the latest checkpoint present in `/results` with a batch size of 8
351
349
352
350
This script outputs predictions file to `/results/predictions.json` and computes F1 score and exact match score using SQuaD's `evaluate-v1.1.py`. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file.
353
351
354
352
The output log contains information about:
355
-
-Eval performance
353
+
-Evaluation performance
356
354
- F1 and exact match score on the Dev Set of SQuaD after evaluation.
357
355
358
356
The summary after inference is printed in the following format:
@@ -412,14 +410,14 @@ Our results were obtained by running batch sizes up to 3x GPUs on a 16GB V100 an
412
410
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in tokens per second) were averaged over an entire training epoch.
413
411
414
412
415
-
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
413
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
420
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
423
421
|:---:|:---:|:-----:|:-----:|:---:|:---:|:----:|
424
422
| 1 | 3 | - |14.86| - | - |1.0 |
425
423
| 4 | 3 | - |44.17| - | - |2.97|
@@ -433,14 +431,14 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
433
431
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epochs.
434
432
435
433
436
-
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
434
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
437
435
|---|---|-----|-----|----|----|----|
438
436
| 1 | 4 | 8.55|18.14|2.12|1.0 |1.0 |
439
437
| 4 | 4 |32.13|52.85|1.64|3.76|2.91|
440
438
| 8 | 4 |62.83|95.28|1.51|7.35|5.25|
441
439
442
440
443
-
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
441
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
444
442
|---|---|-----|-------|---|---|----|
445
443
| 1 | 10| - | 27.69 | - | - |1.0 |
446
444
| 4 | 10| - | 85.193| - | - |3.07|
@@ -455,15 +453,15 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
455
453
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epoch.
456
454
457
455
458
-
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
456
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
459
457
|---|---|------|------|----|-----|----|
460
458
| 1| 4 | 8.80| 17.43|1.98| 1.0 |1.0 |
461
459
| 4| 4 | 33.22| 56.87|1.71| 3.78|3.26|
462
460
| 8| 4 | 64.46|100.58|1.56| 7.33|5.77|
463
461
| 16| 4 |117.83|162.29|1.38|13.39|9.31|
464
462
465
463
466
-
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
464
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speed-up with mixed precision**|**Multi-gpu weak scaling with FP32**|**Multi-gpu weak scaling with FP16**|
467
465
|---|---|---|------|---|---|----|
468
466
| 1| 10| - | 28.72| - | - |1.0 |
469
467
| 4| 10| - | 92.73| - | - |3.22|
@@ -479,7 +477,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
479
477
#### NVIDIA DGX-1 16G (1x V100 16G)
480
478
Our results were obtained by running the `scripts/run_squad_inference.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epoch.
481
479
482
-
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speedup**|
480
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speedup**|
483
481
|---|---|-----|------|----|
484
482
| 1 | 8 |41.04|112.55|2.74|
485
483
@@ -489,7 +487,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
489
487
#### NVIDIA DGX-1 32G (1x V100 32G)
490
488
Our results were obtained by running the `scripts/run_squad_inference.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epoch.
491
489
492
-
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speedup**|
490
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speedup**|
493
491
|---|---|-----|------|----|
494
492
| 1 | 8 |36.78|118.54|3.22|
495
493
@@ -498,7 +496,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
498
496
#### NVIDIA DGX-2 32G (1x V100 32G)
499
497
Our results were obtained by running the `scripts/run_squad_inference.sh` training script in the TensorFlow 19.03-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs. Performance numbers (in sentences per second) were averaged over an entire training epoch.
500
498
501
-
|**Number of GPUs**|**Batch size per GPU**|**FP 32 sentences/sec**|**FP16 sentences/sec**|**Speedup**|
499
+
|**Number of GPUs**|**Batch size per GPU**|**FP32 sentences/sec**|**FP16 sentences/sec**|**Speedup**|
0 commit comments