Support nvidia bert dataset#27
Conversation
jeffra
left a comment
There was a problem hiding this comment.
This is super exciting, thank you Tunji!! :)
|
Oh can we actually do one more thing? Can you add a small snippet of text on how to use this and pointing to the nvidia instructions to get the data using their scripts? You can add it here: https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/README.md |
Done. |
| "bert_token_file": "bert-large-uncased", | ||
| "bert_model_file": "bert-large-uncased", | ||
| "bert_model_config": { | ||
| "vocab_size_or_config_json_file": 119547, |
There was a problem hiding this comment.
Curious, how this vocab_size is 119k with NVIDIA's dataset, whereas, in NVIDIA's own config, it is 30k.
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/bert_config.json#L12
There was a problem hiding this comment.
This is a good question. In reality, vocab_size_or_config_json_file is not used for computing vocabulary size, Instead, the vocabulary is computed from the tokenizer as seen here. For confirmation you can examine the log for the output of this print statement to see the following:
worker-0: VOCAB SIZE: 30528
* Support nvidia bert dataset * Format fixes * E2E run of Nvidia Data with SQUAD 90.6 F1 * Minor fixes * Update README * Update README
Enable bing_bert training on Nvidia dataset