You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: data_analysis/near_dedublication_toolkit/README.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,15 +1,15 @@
1
1
## Near deduplication with min hash lsh index
2
-
Code taken from https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis/near-deduplication and changed to be run on distributed Dask cluster and independent nodes with max RAM usage as for now under 600GB but with potential to be reduced even more if to distribute minhash lsh index itself in step 2 and to divide minhash cluster buckets additionally according to cluster element count, not just cluster number in step 3.
2
+
Code taken from https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis/near-deduplication and changed to be run on distributed Dask cluster and independent nodes with max RAM usage as for now under 600GB but with potential to be reduced even more if we distribute minhash lsh index itself in step 2 and divide minhash cluster buckets additionally according to cluster element count, not just cluster number in step 3.
3
3
4
-
Right now the code is in 5 steps which are to be run manually each. However, pipeline can be fully automated either by reducing memory requirements as described above or if to implement dask worker nodes with different resources on toolkit (dask itself allows it)
4
+
Right now the code is in 5 steps which are to be run manually each. However, pipeline can be fully automated either by reducing memory requirements as described above or if we implement dask worker nodes with different resources on toolkit (dask itself allows it)
5
5
6
6
-`cfg.py` contains configuration
7
7
-`util.py` contains most of the code taken from the above repo
8
8
-`[step_number]_*.py` contains step code
9
9
-`[step_number]_*.child.yaml` contains resources of a worker node if a step is a Dask cluster step
10
10
11
11
### 1 step
12
-
Compute min hashes for all data and stores them to files. Runs as Dask cluster on toolkit.
12
+
Computes min hashes for all data and stores them to files. Runs as Dask cluster on toolkit.
13
13
```
14
14
make text2code_dataset/dataset/postprocessing/near_dedup/1_get_min_hashes.run-slim-dask MORE_JOB_ARGS="--data snow.code_llm.data:/data"
Computes pairwise Jaccard similarity within clusters and identifies near duplicates to remove. Rearranges row to remove form per cluster to per data files as in source file split and removes them from source files. Runs as dask task graph on dask cluster. Currently need manual re-run for one cluster bucket for html as it is too big to fit to worker memory (see notes for previous step)
31
+
Computes pairwise Jaccard similarity within clusters and identifies near duplicates to remove. Rearranges row to remove from per cluster to per data files as in source file split and removes them from source files. Runs as dask task graph on dask cluster. Currently need manual re-run for one cluster bucket for html as it is too big to fit to worker memory (see notes for previous step)
32
32
```
33
33
make text2code_dataset/dataset/postprocessing/near_dedup/4_remove_near_duplicates_by_clusters.run-slim-dask MORE_JOB_ARGS="--data snow.code_llm.data:/data"
0 commit comments