[BERT/PyT][BERT/TF] Switch back to the original server for data download

sharathts · web-flow · commit 04988752a879 · 2021-02-25T14:13:53.000-08:00
* update - wiki download
diff --git a/PyTorch/LanguageModeling/BERT/README.md b/PyTorch/LanguageModeling/BERT/README.md
@@ -280,7 +280,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
 Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
 `/workspace/bert/data/create_datasets_from_start.sh wiki_books`
 
-Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
+Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
 
 6. Start pretraining.
  
diff --git a/PyTorch/LanguageModeling/BERT/data/WikiDownloader.py b/PyTorch/LanguageModeling/BERT/data/WikiDownloader.py
@@ -28,8 +28,8 @@ def __init__(self, language, save_path):
         self.language = language
         # Use a mirror from https://dumps.wikimedia.org/mirrors.html if the below links do not work
         self.download_urls = {
-            'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
-            'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
+            'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
+            'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
         }
 
         self.output_files = {
diff --git a/TensorFlow/LanguageModeling/BERT/README.md b/TensorFlow/LanguageModeling/BERT/README.md
@@ -269,7 +269,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
 Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
 `bash scripts/data_download.sh wiki_books`
 
-Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
+Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2`  and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
 
 4. Download the pretrained models from NGC.
 
diff --git a/TensorFlow/LanguageModeling/BERT/data/WikiDownloader.py b/TensorFlow/LanguageModeling/BERT/data/WikiDownloader.py
@@ -26,8 +26,8 @@ def __init__(self, language, save_path):
 
         self.language = language
         self.download_urls = {
-            'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
-            'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
+            'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
+            'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
         }
 
         self.output_files = {

Original file line number	Diff line number	Diff line change
`@@ -26,8 +26,8 @@ def __init__(self, language, save_path):`
`26`	`26`
`27`	`27`	`self.language = language`
`28`	`28`	`self.download_urls = {`
`29`		`- 'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',`
`30`		`- 'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'`
	`29`	`+ 'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',`
	`30`	`+ 'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'`
`31`	`31`	`}`
`32`	`32`
`33`	`33`	`self.output_files = {`