You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: data_analysis/near-deduplication/README.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ Code for running near-deduplication with MinHash and LSH indexing
8
8
pip install -r requirements.txt
9
9
````
10
10
11
-
Login to be able to be able to push the dataset to the hub after deduplication and clone your huggingface-hub repositories:
11
+
Login to be able to push the dataset to the hub after deduplication and clone your huggingface-hub repositories:
12
12
13
13
````
14
14
huggingface-cli login
@@ -31,7 +31,7 @@ python near_deduplicate.py \
31
31
--text_column content
32
32
````
33
33
34
-
To make just a test run with a subset of the data set `test_run` argument to True.
34
+
To make just a test run with a subset of the data, set `test_run` argument to True.
35
35
36
36
The first time you load the dataset might be slow if it is large, but the data is saved in the cache thanks to `datasets`, and the subsequent calls will be fast.
37
37
@@ -57,11 +57,11 @@ This is for the alternative script that is designed for single-machine setup.
57
57
58
58
##### Scaling
59
59
60
-
To understand the limitation of current deduplication implementation, it is important to have an idea of how each step in the pipleine affects the overall time:
61
-
1. Minhashing is fast, but it takes loner for long documents. Hashing scales with both the number of cores and single core performance (clock speed, for example). With `datasets`'s caching, it also does not require much memory.
62
-
2. Indexing is basically putting minhash signatures into different buckets. This is one bottleneck in this pipleine. In an ideal situation where MapReduce is seamlessly integrated with other parts, it can be further improved with distributed buckets.
60
+
To understand the limitation of current deduplication implementation, it is important to have an idea of how each step in the pipeline affects the overall time:
61
+
1. Minhashing is fast, but it takes longer for long documents. Hashing scales with both the number of cores and single core performance (clock speed, for example). With `datasets`s caching, it also does not require much memory.
62
+
2. Indexing is basically putting minhash signatures into different buckets. This is one bottleneck in this pipeline. In an ideal situation where MapReduce is seamlessly integrated with other parts, it can be further improved with distributed buckets.
63
63
3. Depending on how you look at duplicates, querying can be easily created by iterating the buckets or iterating the simhashes.
64
-
4. Depending on how you decide to group duplicates, you can build a graph and then do connected component analysis or use simple algorithm like union-find.
64
+
4. Depending on how you decide to group duplicates, you can build a graph and then do connected component analysis or use a simple algorithm like union-find.
65
65
5. What to do with a group of duplicates is also a widely open question. We opt to keep one document within a group/cluster in this case.
0 commit comments