In this folder we provide code for analysis of code datasets:
-
Near deduplication using MinHash and LSH
-
Data decontamination from HumanEval and MBPP evaluation benchmarks
-
Python data analysis:
- Natural language distribution in comments/docstrings
- Detection of configuration and test files (valid for other languages than Python)
- Estimation of the number of files that can be successfully compiled
-
Comment to code ratio: analysis notebook for filtering based on the ratio of comments in a file. Filtering code available at bigcode-dataset/preprocessing
-
Stars filtering: analysis notebook for filtering based on the number of stars of files. Filtering code available at bigcode-dataset/preprocessing
-
PII Redaction: moved to bigcode-dataset/preprocessing
- PII detection of emails, IP addresses and secret keys
- PII anonymization
- Pipeline evaluation on an annotated benchmark
-
Preprocessing: moved to bigcode-dataset/preprocessing
- code for data filtering based on line length and percentage of alphanumeric characters, comment to code ratio and stars.