Support nvidia bert dataset by tjruwase · Pull Request #27 · deepspeedai/DeepSpeedExamples

tjruwase · 2020-07-21T23:08:58Z

Enable bing_bert training on Nvidia dataset

jeffra

This is super exciting, thank you Tunji!! :)

jeffra · 2020-07-23T22:30:07Z

Oh can we actually do one more thing? Can you add a small snippet of text on how to use this and pointing to the nvidia instructions to get the data using their scripts? You can add it here: https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/README.md

tjruwase · 2020-07-24T05:41:33Z

Oh can we actually do one more thing? Can you add a small snippet of text on how to use this and pointing to the nvidia instructions to get the data using their scripts? You can add it here: https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/README.md

Done.

piyushghai · 2020-07-24T16:54:26Z

+    "bert_token_file": "bert-large-uncased",
+    "bert_model_file": "bert-large-uncased",
+    "bert_model_config": {
+        "vocab_size_or_config_json_file": 119547,


Curious, how this vocab_size is 119k with NVIDIA's dataset, whereas, in NVIDIA's own config, it is 30k.
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/bert_config.json#L12

This is a good question. In reality, vocab_size_or_config_json_file is not used for computing vocabulary size, Instead, the vocabulary is computed from the tokenizer as seen here. For confirmation you can examine the log for the output of this print statement to see the following:

worker-0: VOCAB SIZE: 30528

* Support nvidia bert dataset * Format fixes * E2E run of Nvidia Data with SQUAD 90.6 F1 * Minor fixes * Update README * Update README

Support nvidia bert dataset

6a845f1

tjruwase requested review from RezaYazdaniAminabadi, ShadenSmith, eltonzheng, jeffra and minjiaz July 21, 2020 23:08

tjruwase added 2 commits July 21, 2020 23:11

Format fixes

d051873

E2E run of Nvidia Data with SQUAD 90.6 F1

58fed1e

jeffra approved these changes Jul 23, 2020

View reviewed changes

tjruwase added 3 commits July 23, 2020 22:42

Minor fixes

330cbf8

Update README

edfe9bf

Update README

f8078fe

tjruwase merged commit 47766e0 into master Jul 24, 2020

tjruwase mentioned this pull request Jul 24, 2020

Bing BERT #8

Open

tjruwase linked an issue Jul 24, 2020 that may be closed by this pull request

Bing BERT #8

Open

piyushghai reviewed Jul 24, 2020

View reviewed changes

conglongli mentioned this pull request Aug 13, 2020

fix bing bert validation issue #30

Merged

tjruwase mentioned this pull request Aug 21, 2020

How to reproduce BERT perf results in deepspeed blog deepspeedai/DeepSpeed#272

Open

hwchen2017 pushed a commit that referenced this pull request Jun 8, 2025

Support nvidia bert dataset (#27)

e039033

* Support nvidia bert dataset * Format fixes * E2E run of Nvidia Data with SQUAD 90.6 F1 * Minor fixes * Update README * Update README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support nvidia bert dataset#27

Support nvidia bert dataset#27
tjruwase merged 6 commits into
masterfrom
olruwase/nvidia_bert_dataset

tjruwase commented Jul 21, 2020

Uh oh!

jeffra left a comment

Uh oh!

jeffra commented Jul 23, 2020

Uh oh!

tjruwase commented Jul 24, 2020

Uh oh!

piyushghai Jul 24, 2020

Uh oh!

tjruwase Jul 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tjruwase commented Jul 21, 2020

Uh oh!

jeffra left a comment

Choose a reason for hiding this comment

Uh oh!

jeffra commented Jul 23, 2020

Uh oh!

tjruwase commented Jul 24, 2020

Uh oh!

piyushghai Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

tjruwase Jul 25, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants