bigcode-project · joaomonteirof · May 30, 2023 · May 30, 2023
diff --git a/README.md b/README.md
@@ -1,13 +1,17 @@
-# research-BertBigCode
-Exploration of BERT-like models trained on The Stack
+# BERT pre-training on The Stack
+Exploration of BERT-like models trained on The Stack.
 
+- Code used to train [StarEncoder](https://huggingface.co/bigcode/starencoder).
+  - StarEncoder was fine-tuned for PII detection to pre-process the data used to train [StarCoder](https://arxiv.org/abs/2305.06161)
 
-**Work in progress. Currently training on the subsample of The Stack.**
+- This repo also contains functionality to train encoders with contrastive objectives.
 
-[Project information.](https://docs.google.com/document/d/1gjf7Y2Ek64xSyl8HE3GoK1kxDgsV8kjy-9pyIBkR-RQ/edit?usp=sharing)
+- [More details.](https://docs.google.com/document/d/1gjf7Y2Ek64xSyl8HE3GoK1kxDgsV8kjy-9pyIBkR-RQ/edit?usp=sharing)
 
 
-## To run locally:
+## To launch pre-training:
+
+After installing requirements, training can be launched via the example launcher script:
 
 ```
 ./launcher.sh
@@ -18,4 +22,7 @@ Exploration of BERT-like models trained on The Stack
 - ```--train_data_name``` can be used to use to set the training dataset.
 
 - Hyperparamaters can be changed in ```exp_configs.py```.
-  - The tokenizer to be used is treated as a hyperparameter and then must also be set in ```exp_configs.py```
+  - The tokenizer to be used is treated as a hyperparameter and then must also be set in ```exp_configs.py```.
+  - alpha is used to weigh the BERT losses (NSP+MLM) and the contrastive objective.
+    - Setting alpha to 1 corresponds to the standard BERT objective.
+  - Token masking probabilities are set as separate hyperparameters, one for MLM and another one for the contrastive loss.