Skip to content

Commit 3d59216

Browse files
sharathtsszmigacz
authored andcommitted
[BERT] [PyTorch] Data prep fix (NVIDIA#171)
* add dgx1-16g and dgx2 specific pretraining instructions * fix typo in readme * fix data prep and reflect changes in pretraining * remove .ide files * remove data files * Point to right SQUAD location * remove garbage [[]] * default accumulation in fp32 * remove ide files * fix phase2 DATADIR path * remove readme in data folder
1 parent b6fb9aa commit 3d59216

30 files changed

Lines changed: 50 additions & 78 deletions

PyTorch/LanguageModeling/BERT/.dockerignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ data/sharded/
55
data/hdf5/
66
vocab/
77
results/
8+
checkpoints/*

PyTorch/LanguageModeling/BERT/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ __pycache__/
1111
#Data
1212
data/*/*/
1313
data/*/*.zip
14+
data/*
1415

1516
# Distribution / packaging
1617
.Python

PyTorch/LanguageModeling/BERT/Dockerfile

100644100755
File mode changed.

PyTorch/LanguageModeling/BERT/LICENSE

100644100755
File mode changed.

PyTorch/LanguageModeling/BERT/NOTICE

100644100755
File mode changed.

PyTorch/LanguageModeling/BERT/README.md

100644100755
Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -199,19 +199,15 @@ If you want to use a pretrained checkpoint, visit [NGC](https://ngc.nvidia.com/c
199199

200200
4. Start an interactive session in the NGC container to run training/inference.
201201

202-
`bash scripts/docker/launch.sh <DATA_DIR> <VOCAB_DIR> <CHECKPOINT_DIR> <RESULTS_DIR>`
203-
204-
`<DATA_DIR>` - Path to `data` folder in the cloned repository. This directory contains scripts needed to download datasets and where the data will be downloaded.
205-
206-
`<VOCAB_DIR>` - Path to `vocab` folder in the cloned repository. This is the vocabulary with which BERT checkpoint is pretrained.
202+
`bash scripts/docker/launch.sh <CHECKPOINT_DIR> <RESULTS_DIR>`
207203

208204
`<CHECKPOINT_DIR>` - Path to folder containing the downloaded pretrained checkpoint from step 2 for fine-tuning.
209205

210206
`<RESULTS_DIR>` - Path to folder where logs and checkpoints will be saved.
211207

212208
The above paths present on the local machine get mounted to predefined locations in the container.
213209

214-
`data` and `vocab` are a part of `.dockerignore` in order to provide the user the ability to mount datasets of choice and not necessarily the ones downloaded by the script below. In this case, `<DATA_DIR>` points to users corpus. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.
210+
`data` and `vocab.txt` are downloaded in `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.
215211

216212
5. Download and preprocess the dataset.
217213

PyTorch/LanguageModeling/BERT/bert_config.json

100644100755
File mode changed.

PyTorch/LanguageModeling/BERT/bind_pyt.py

100644100755
File mode changed.

PyTorch/LanguageModeling/BERT/checkpoints/.gitkeep

Whitespace-only changes.

PyTorch/LanguageModeling/BERT/create_pretraining_data.py

100644100755
File mode changed.

0 commit comments

Comments
 (0)