Skip to content

Commit 0178c92

Browse files
author
denisko
committed
fix: readd typo fixes
1 parent 755a3a6 commit 0178c92

1 file changed

Lines changed: 4 additions & 4 deletions

File tree

  • data_analysis/near_dedublication_toolkit

data_analysis/near_dedublication_toolkit/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
## Near deduplication with min hash lsh index
2-
Code taken from https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis/near-deduplication and changed to be run on distributed Dask cluster and independent nodes with max RAM usage as for now under 600GB but with potential to be reduced even more if to distribute minhash lsh index itself in step 2 and to divide minhash cluster buckets additionally according to cluster element count, not just cluster number in step 3.
2+
Code taken from https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis/near-deduplication and changed to be run on distributed Dask cluster and independent nodes with max RAM usage as for now under 600GB but with potential to be reduced even more if we distribute minhash lsh index itself in step 2 and divide minhash cluster buckets additionally according to cluster element count, not just cluster number in step 3.
33

4-
Right now the code is in 5 steps which are to be run manually each. However, pipeline can be fully automated either by reducing memory requirements as described above or if to implement dask worker nodes with different resources on toolkit (dask itself allows it)
4+
Right now the code is in 5 steps which are to be run manually each. However, pipeline can be fully automated either by reducing memory requirements as described above or if we implement dask worker nodes with different resources on toolkit (dask itself allows it)
55

66
- `cfg.py` contains configuration
77
- `util.py` contains most of the code taken from the above repo
88
- `[step_number]_*.py` contains step code
99
- `[step_number]_*.child.yaml` contains resources of a worker node if a step is a Dask cluster step
1010

1111
### 1 step
12-
Compute min hashes for all data and stores them to files. Runs as Dask cluster on toolkit.
12+
Computes min hashes for all data and stores them to files. Runs as Dask cluster on toolkit.
1313
```
1414
make text2code_dataset/dataset/postprocessing/near_dedup/1_get_min_hashes.run-slim-dask MORE_JOB_ARGS="--data snow.code_llm.data:/data"
1515
```
@@ -28,7 +28,7 @@ python text2code_dataset/dataset/postprocessing/near_dedup/3_launchlocal_group_d
2828
```
2929

3030
### 4 step
31-
Computes pairwise Jaccard similarity within clusters and identifies near duplicates to remove. Rearranges row to remove form per cluster to per data files as in source file split and removes them from source files. Runs as dask task graph on dask cluster. Currently need manual re-run for one cluster bucket for html as it is too big to fit to worker memory (see notes for previous step)
31+
Computes pairwise Jaccard similarity within clusters and identifies near duplicates to remove. Rearranges row to remove from per cluster to per data files as in source file split and removes them from source files. Runs as dask task graph on dask cluster. Currently need manual re-run for one cluster bucket for html as it is too big to fit to worker memory (see notes for previous step)
3232
```
3333
make text2code_dataset/dataset/postprocessing/near_dedup/4_remove_near_duplicates_by_clusters.run-slim-dask MORE_JOB_ARGS="--data snow.code_llm.data:/data"
3434
```

0 commit comments

Comments
 (0)