Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 13 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
# research-BertBigCode
Exploration of BERT-like models trained on The Stack
# BERT pre-training on The Stack
Exploration of BERT-like models trained on The Stack.

- Code used to train [StarEncoder](https://huggingface.co/bigcode/starencoder).
- StarEncoder was fine-tuned for PII detection to pre-process the data used to train [StarCoder](https://arxiv.org/abs/2305.06161)

**Work in progress. Currently training on the subsample of The Stack.**
- This repo also contains functionality to train encoders with contrastive objectives.

[Project information.](https://docs.google.com/document/d/1gjf7Y2Ek64xSyl8HE3GoK1kxDgsV8kjy-9pyIBkR-RQ/edit?usp=sharing)
- [More details.](https://docs.google.com/document/d/1gjf7Y2Ek64xSyl8HE3GoK1kxDgsV8kjy-9pyIBkR-RQ/edit?usp=sharing)


## To run locally:
## To launch pre-training:

After installing requirements, training can be launched via the example launcher script:

```
./launcher.sh
Expand All @@ -18,4 +22,7 @@ Exploration of BERT-like models trained on The Stack
- ```--train_data_name``` can be used to use to set the training dataset.

- Hyperparamaters can be changed in ```exp_configs.py```.
- The tokenizer to be used is treated as a hyperparameter and then must also be set in ```exp_configs.py```
- The tokenizer to be used is treated as a hyperparameter and then must also be set in ```exp_configs.py```.
- alpha is used to weigh the BERT losses (NSP+MLM) and the contrastive objective.
- Setting alpha to 1 corresponds to the standard BERT objective.
- Token masking probabilities are set as separate hyperparameters, one for MLM and another one for the contrastive loss.