decontamination

Decontamination

This directory contains several scripts for decontamination of the data.

Exact prompt matching find_substrings.py
Near matching minhash.py

Near Matching with MinHash and LSH

This is similar to the near deduplication script data_analysis/near-deduplication/minhash_deduplication_alt.py with one modification: we use benchmark datasets as index source instead of the dataset itself.

Usage:

Update the script to include any benchmark you want to check agains in DATASETS_TO_CHECK. Be sure to create a global variable for the index using the same name in that config. Benchmark columns should be of type string or sequence of string, so that they can be concatenated.
Then you can run the script by

pip install -r requirements_minhash.txt
# Quick example
python minhash.py \
  --dataset codeparrot/codeparrot-clean-valid \
  --split train \
  --column content \
  --cache-dir .cache \
  --verbose
# Check parameters with the help message
python minhash.py --help

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
find_substrings.py		find_substrings.py
minhash.py		minhash.py
requirements.txt		requirements.txt
requirements_minhash.txt		requirements_minhash.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Decontamination

Near Matching with MinHash and LSH

Usage:

FilesExpand file tree

decontamination

Directory actions

More options

Directory actions

More options

Latest commit

History

decontamination

Folders and files

parent directory

README.md

Decontamination

Near Matching with MinHash and LSH

Usage: