Skip to content

Support nvidia bert dataset#27

Merged
tjruwase merged 6 commits into
masterfrom
olruwase/nvidia_bert_dataset
Jul 24, 2020
Merged

Support nvidia bert dataset#27
tjruwase merged 6 commits into
masterfrom
olruwase/nvidia_bert_dataset

Conversation

@tjruwase

Copy link
Copy Markdown
Contributor

Enable bing_bert training on Nvidia dataset

@jeffra jeffra left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super exciting, thank you Tunji!! :)

@jeffra

jeffra commented Jul 23, 2020

Copy link
Copy Markdown
Contributor

Oh can we actually do one more thing? Can you add a small snippet of text on how to use this and pointing to the nvidia instructions to get the data using their scripts? You can add it here: https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/README.md

@tjruwase

Copy link
Copy Markdown
Contributor Author

Oh can we actually do one more thing? Can you add a small snippet of text on how to use this and pointing to the nvidia instructions to get the data using their scripts? You can add it here: https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/README.md

Done.

@tjruwase tjruwase merged commit 47766e0 into master Jul 24, 2020
@tjruwase tjruwase mentioned this pull request Jul 24, 2020
@tjruwase tjruwase linked an issue Jul 24, 2020 that may be closed by this pull request
"bert_token_file": "bert-large-uncased",
"bert_model_file": "bert-large-uncased",
"bert_model_config": {
"vocab_size_or_config_json_file": 119547,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, how this vocab_size is 119k with NVIDIA's dataset, whereas, in NVIDIA's own config, it is 30k.
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/bert_config.json#L12

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. In reality, vocab_size_or_config_json_file is not used for computing vocabulary size, Instead, the vocabulary is computed from the tokenizer as seen here. For confirmation you can examine the log for the output of this print statement to see the following:

worker-0: VOCAB SIZE: 30528

hwchen2017 pushed a commit that referenced this pull request Jun 8, 2025
* Support nvidia bert dataset

* Format fixes

* E2E run of Nvidia Data with SQUAD 90.6 F1

* Minor fixes

* Update README

* Update README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bing BERT

3 participants