Skip to content

Commit 0498875

Browse files
authored
[BERT/PyT][BERT/TF] Switch back to the original server for data download
* update - wiki download
1 parent 2b0daf3 commit 0498875

4 files changed

Lines changed: 6 additions & 6 deletions

File tree

PyTorch/LanguageModeling/BERT/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
280280
Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
281281
`/workspace/bert/data/create_datasets_from_start.sh wiki_books`
282282

283-
Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
283+
Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
284284

285285
6. Start pretraining.
286286

PyTorch/LanguageModeling/BERT/data/WikiDownloader.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ def __init__(self, language, save_path):
2828
self.language = language
2929
# Use a mirror from https://dumps.wikimedia.org/mirrors.html if the below links do not work
3030
self.download_urls = {
31-
'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
32-
'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
31+
'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
32+
'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
3333
}
3434

3535
self.output_files = {

TensorFlow/LanguageModeling/BERT/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -269,7 +269,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
269269
Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
270270
`bash scripts/data_download.sh wiki_books`
271271

272-
Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
272+
Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
273273

274274
4. Download the pretrained models from NGC.
275275

TensorFlow/LanguageModeling/BERT/data/WikiDownloader.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ def __init__(self, language, save_path):
2626

2727
self.language = language
2828
self.download_urls = {
29-
'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
30-
'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
29+
'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
30+
'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
3131
}
3232

3333
self.output_files = {

0 commit comments

Comments
 (0)