Add local pHash deduplication utility by vinay553 · Pull Request #462 · scaleapi/nucleus-python-client

vinay553 · 2026-06-04T15:33:07Z

Summary

Adds a local pHash-based deduplication utility for Nucleus SDK users who already have a large list of DatasetItem objects or rows from items_and_annotation_generator().

Exposes nucleus.deduplicate_by_phash(...) and LocalDeduplicationResult.
Accepts either DatasetItem objects or generator rows with an "item" DatasetItem.
Preserves Nucleus threshold semantics for 0..64.
Uses fast exact-pHash handling for threshold == 0, a single-item fast path for threshold == 64, an exact chunked Hamming-neighbor index for 1 <= threshold <= 10, and a linear exact fallback for thresholds above 10.
Returns the surviving original objects, surviving dataset items, reference IDs, and DeduplicationStats.

Writeup: Local pHash Deduplication

Baseline Problem

The logical algorithm is greedy deduplication. Sort rows deterministically, then keep a candidate only if no previously kept pHash is within the requested Hamming distance threshold.

kept = []
for row in sorted_rows:
    if not any(hamming(row.phash, kept_hash) <= threshold for kept_hash in kept):
        kept.append(row)

This is exact but effectively quadratic for mostly distinct inputs, because each new candidate may scan every pHash kept so far.

Technique 1: Fail Fast and Normalize Once

Before the dedup pass starts, each input is normalized into an internal record with the original object, its DatasetItem, a stable tie-break ID, and the pHash parsed as a 64-bit integer. Threshold validation and pHash validation happen up front, so a long local run fails before doing expensive work if any pHash is missing or malformed.

Technique 2: Integer pHashes and Integer Hamming Distance

Backend pHashes arrive as 64-character binary strings. The utility parses each pHash once with int(phash, 2), then computes Hamming distance with:

(candidate_phash_value ^ kept_phash_value).bit_count()

This avoids per-character string comparison and enables sorting by (phash_value, stable_id).

Technique 3: Threshold-Specific Fast Paths

Threshold case	Implementation	Reason
`threshold == 0`	Keep the first row per exact integer pHash.	Hamming distance must be exactly zero, so only exact pHash duplicates are removed.
`threshold == 64`	Keep the first sorted item.	Any two 64-bit hashes are within distance 64.
`1 <= threshold <= 10`	Use the exact chunked Hamming index.	This is the expected practical range and keeps chunk radius small.
`threshold > 10`	Use a linear greedy scan.	Larger radii make chunk neighborhoods dense enough that the index becomes less attractive.

Technique 4: Chunking With the Pigeonhole Principle

The accelerated path splits each 64-bit pHash into six chunks:

INDEX_CHUNK_BITS = (11, 11, 11, 11, 10, 10)

For threshold t, the per-chunk radius is floor(t / 6). If two pHashes have total Hamming distance at most t, at least one of the six chunks must be within that per-chunk radius. Otherwise every chunk would exceed the radius, forcing the total Hamming distance above t.

For threshold=10, the per-chunk radius is 1. One partition performs 4 * 12 + 2 * 11 = 70 dictionary lookups: each 11-bit chunk checks the exact value plus 11 one-bit variants, and each 10-bit chunk checks the exact value plus 10 one-bit variants.

Technique 5: Two Partitions With Cheap Rotation

A single chunk partition can produce many false candidates. The utility uses two exact partitions:

Partition 0 splits the original 64-bit pHash into six contiguous chunks.
Partition 1 rotates the pHash right by 8 bits, then splits into the same chunk widths.

The rotation is cheap:

((phash_value >> 8) | (phash_value << 56)) & ((1 << 64) - 1)

The pigeonhole argument applies independently to both partitions, so true duplicates cannot be missed. A candidate must be found by both partitions before the exact 64-bit Hamming check runs.

Technique 6: Reusable Candidate Marks Instead of Set Intersections

The permanent state remains simple: kept_hashes stores pHashes that survived deduplication, and chunk dictionaries point chunk values back to positions in kept_hashes. Those positions are called kept indexes.

Conceptually, each query intersects two candidate sets:

partition_0_candidates = {17, 31, 44, 91, ...}
partition_1_candidates = {3, 17, 88, 91, ...}
intersection = {17, 91, ...}

The first implementation literally allocated Python set() objects and intersected them per candidate. The current implementation uses a reusable bytearray with one mark byte per kept pHash:

# Partition 0
candidate_marks[17] = 1
candidate_marks[31] = 1
candidate_marks[44] = 1
candidate_marks[91] = 1

# Partition 1
if candidate_marks[17] == 1: exact_check(17)
if candidate_marks[88] == 1: exact_check(88)  # skipped; mark is zero
if candidate_marks[91] == 1: exact_check(91)

After each query, only touched indexes are cleared. This computes the same intersection while avoiding large temporary set allocations.

Real Dataset Profile

I profiled this on an internal real production dataset with 278,587 rows. Because this repository is public, the dataset identifier and name are intentionally omitted from the PR body. The API key was not included in the report or the PR.

The profiler exported rows with items_and_annotation_generator(), held the exported rows in memory, and then timed only deduplicate_by_phash(...) so export/API time and local dedup CPU time are separated.

Input rows	Export time	Missing pHashes	Malformed pHashes	Duplicate reference IDs
278,587	127.04s	0	0	0

Local dedup timing only:

Threshold	Kept	Removed	Removed %	Dedup time	Peak tracemalloc	Process max RSS
0	277,164	1,423	0.51%	1.86s	47.90 MB	1.41 GB
5	252,084	26,503	9.51%	50.79s	66.94 MB	1.46 GB
10	194,776	83,811	30.08%	396.58s	59.02 MB	1.46 GB

The practical read: the expensive exact local dedup case took about 6m 36.6s for 278,587 real rows. If a user exports the dataset and runs threshold 10, the observed end-to-end time for this dataset would be roughly 127.04s + 396.58s = 523.62s, or about 8m 43.6s.

That is a reasonable runtime for an offline local utility, especially because the implementation is deterministic, exact, requires no server-side job, and validates pHashes before starting the expensive pass.

Validation

NUCLEUS_PYTEST_API_KEY=dummy .venv/bin/python -m pytest tests/test_local_deduplication.py -q
Real dataset profile using the local SDK editable install in .venv.

Focused tests cover row preservation, DatasetItem inputs, threshold 0, threshold 64, invalid thresholds, malformed pHashes, unsupported input shapes, custom sort keys, and parity against a simple linear baseline for thresholds 1, 5, and 10.

Greptile Summary

This PR adds a fully offline deduplicate_by_phash() utility and LocalDeduplicationResult to the Nucleus SDK. It requires no API key, accepts DatasetItem objects or generator rows, and dispatches to an exact-match path (threshold 0), a two-partition chunked Hamming index (thresholds 1–10), a linear greedy fallback (thresholds 11–63), or a keep-one path (threshold 64).

nucleus/local_deduplication.py: New module implementing the full pipeline — pHash validation, stable-id tie-breaking via _tie_break_key, the _HammingIndex with pigeonhole-based candidate filtering and reusable bytearray marks, and the three dedup paths.
tests/test_local_deduplication.py: Covers all dispatch paths, custom/integer/None sort keys, ordinal fallback, missing reference IDs, and parity against a linear baseline for thresholds 1, 5, 10, 11, and 20.
nucleus/__init__.py: Re-exports deduplicate_by_phash and LocalDeduplicationResult from the public namespace.

Confidence Score: 5/5

Safe to merge — the new utility is fully offline and additive, with no changes to existing API paths or data models.

The algorithm is provably correct: the pigeonhole argument holds for both partitions independently, the mark-based candidate intersection correctly resets state after every query, and all three dedup paths produce results consistent with the linear baseline. The previous issues around string-encoded sort keys and empty-string reference IDs are both resolved. The type annotation for sort_key's return type is the only gap.

No files require special attention.

Important Files Changed

Filename	Overview
nucleus/local_deduplication.py	New file: implements the full local pHash dedup pipeline — exact path, two-partition Hamming index, and linear fallback — with correct mark-based candidate intersection and clean threshold dispatch.
tests/test_local_deduplication.py	New file: comprehensive test suite covering all fast-paths, sort-key variants, error cases, and a parity baseline for thresholds 1, 5, 10, 11, and 20 (index and linear-scan paths).
nucleus/init.py	Exports LocalDeduplicationResult and deduplicate_by_phash from the public nucleus namespace.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[deduplicate_by_phash] --> B[_validate_threshold]
    B --> C[_normalize_records]
    C --> D{empty?}
    D -- yes --> E[return empty result]
    D -- no --> F[sort by phash_value, stable_id]
    F --> G{threshold?}
    G -- 0 --> H[_deduplicate_exact]
    G -- 64 --> I[keep records 0]
    G -- 1..10 --> J[_deduplicate_with_index]
    G -- 11..63 --> K[_deduplicate_with_linear_scan]
    H --> L[LocalDeduplicationResult]
    I --> L
    J --> L
    K --> L

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nucleus/local_deduplication.py:162
The `sort_key` callable's return type is annotated as `Union[str, int]`, but `_tie_break_key` explicitly handles a `None` return, and `test_deduplicate_by_phash_preserves_ordinal_fallback_order` relies on it with `sort_key=lambda row: None`. A caller reading the signature has no indication that returning `None` is valid and falls back to ordinal ordering — the annotation should reflect this.

```suggestion
    sort_key: Optional[Callable[[InputT], Optional[Union[str, int]]]] = None,
```

_{Reviews (8): Last reviewed commit: "Run pre-commit formatting" | Re-trigger Greptile}

vinay553 · 2026-06-04T15:44:56Z

@greptile-apps review this pr

edwinpav · 2026-06-04T19:25:05Z

@greptileai review

edwinpav

(meant to click approve)

Wowza what an optimization. Took me a while to understand the logic behind it. Only left some small nit comments.

I'd suggest adding a comment like my explanation below within the code to help future readers.

Let me know if my understanding is correct (had ai format the wording):

We can safely skip most images without comparing against them, thanks to the pigeonhole principle. Chunks are defined by bit position, so we always compare a new pHash against a kept one chunk-by-chunk at the same positions — chunk 0 against chunk 0, chunk 1 against chunk 1, and so on. The principle applies to the positions where two hashes disagree: if two images are true duplicates, they differ in at most 10 of the 64 bits. Because the 6 chunks don't overlap and together cover all 64 bits, the total difference is just the sum of the per-chunk differences. So if every chunk differed by 2 or more bits, the total would be at least 2 × 6 = 12 — but a duplicate has at most 10 differences, and you can't have ≤10 and ≥12 at once. That contradiction forces at least one chunk down to at most floor(10/6) = 1 differing bit. In other words, for any real duplicate, there is guaranteed to be at least one chunk where the two hashes are within 1 bit of each other.

So we index every kept image by all 6 of its chunk values, and for each new pHash we split it into 6 chunks and probe all of them — each at its exact value plus every value within chunk_radius = 1 bit (the variant masks). We check all 6 because pigeonhole guarantees some chunk is close but not which one. Any kept image found in those buckets is a candidate, which we then confirm with an exact Hamming-distance check on the full hash (sharing one near-chunk is necessary but not sufficient, since the other chunks could still differ a lot). If none of the 6 chunks produces a match, the image is guaranteed not to be a duplicate — because a real duplicate would have been forced to leave at least one near-matching chunk behind.

The chunk lookups are dictionary (hash-map) lookups, which jump directly to the drawer for a given chunk value without scanning the other drawers. So for each new pHash we do a fixed number of cheap lookups (6 chunks × the variant masks, a constant that doesn't grow with the dataset), and those lookups hand us only the small set of kept items that share a near-chunk — the candidates. We run the expensive full-hash Hamming check only on those candidates, never on the rest. Every kept item filed under unrelated chunk values is never touched at all. Naive dedup is O(n) comparisons per new item (compare against all n kept items), so the whole run is O(n²); the index replaces that per-item scan with a handful of constant-time lookups plus a few exact checks on candidates, so in practice each item costs roughly O(1) instead of O(n) — turning an O(n²) algorithm into something close to O(n). The pigeonhole guarantee is what makes this safe: since a real duplicate is forced to share a near-chunk, the items we skip are provably not duplicates, so ignoring them costs us no correctness.

edwinpav · 2026-06-04T20:06:49Z

+    )
+
+
+def _validate_threshold(threshold: int) -> None:


is this validation not also happening in the original dedup sdk path? either way, can it be shared between both?

I left this local-only for now. The existing async Dataset.deduplicate() path relies on backend/job validation today, and the existing integration tests expect NucleusAPIError/JobError for invalid thresholds. Adding shared client-side validation there would change that public error behavior. This local helper needs the upfront check because it does not make an API call and should fail before doing CPU-heavy local work.

edwinpav · 2026-06-04T20:08:38Z

+        phash = dataset_item.phash
+        if phash is None or not PHASH_REGEX.fullmatch(phash):
+            ref_id = dataset_item.reference_id
+            ref_msg = f" with reference_id {ref_id!r}" if ref_id else ""


nit: would it be useful to include the dataset_item_id as well?

I did not add this because the supported local input shape only exposes a DatasetItem, and the SDK currently discards the backend item ID when it builds DatasetItem objects from export/API payloads. reference_id is the only reliable identifier available here; adding a best-effort dataset_item_id field would usually be absent for the exact generator rows this utility is intended to consume.

edwinpav · 2026-06-04T20:10:26Z

+    if not records:
+        return LocalDeduplicationResult(
+            unique=[],
+            unique_dataset_items=[],
+            unique_reference_ids=[],
+            stats=DeduplicationStats(
+                threshold=threshold,
+                original_count=0,
+                deduplicated_count=0,
+            ),
+        )


nbd but might be slightly more efficient to place this above the sort on 179? if i understand correctly the sort won't drop any records

Done, the empty-input return now happens immediately after normalization and before the sort.

edwinpav · 2026-06-04T20:12:31Z

+            for _ in range(PARTITION_COUNT)
+        ]
+
+    def add(self, phash_value: int) -> None:


nit: should this also be _add

edwinpav · 2026-06-04T20:12:41Z

+                index = self._indexes[partition_index][chunk_index]
+                index.setdefault(chunk_value, []).append(kept_index)
+
+    def find_duplicate_index(self, phash_value: int) -> Optional[int]:


same as #462 (comment)

vinay553 · 2026-06-05T16:49:33Z

@edwinpav I think your explanation is pretty good. To be pedantic, it's not necessarily most and the runtime stuff is handwavy and really can only be validated empirically. Some other key points:
The size of the chunks is key. I picked 10 as a reasonable upper limit for the threshold. So given 10 we needed to keep the chunks exactly this size so the floor(10/6) = 1, because if it was floor(10/N) > 1, the amount of adjustments scale exponentially.
The other thing is the bit rotation to narrow candidates down further.

edwinpav · 2026-06-05T17:08:13Z

@edwinpav I think your explanation is pretty good. To be pedantic, it's not necessarily most and the runtime stuff is handwavy and really can only be validated empirically. Some other key points: The size of the chunks is key. I picked 10 as a reasonable upper limit for the threshold. So given 10 we needed to keep the chunks exactly this size so the floor(10/6) = 1, because if it was floor(10/N) > 1, the amount of adjustments scale exponentially. The other thing is the bit rotation to narrow candidates down further.

My understanding was that the bit rotation is because since the floor(10/n) = 1, we need to rotate each candidate chunk by 1 bit in each position to check if its 1 bit difference but any stores ones in that chunk. Or is it also/just for something else?

edwinpav · 2026-06-05T17:13:40Z

also @vinay553 if u follow this and run poetry run pre-commit install it'll run all lint checks for you locally on each commit so you can fix them before ci/cd fails on it

vinay553 · 2026-06-05T19:49:39Z

@edwinpav ah I wouldn't call that rotating, that's just applying a mask. rotation is when you rotate the bits like 1234 -> 2341. The reason is that any potential duplicate candidate would need to pass the chunk test both times before we do the full 64-bit check, reducing then number of candidates.

edwinpav · 2026-06-05T19:51:06Z

@edwinpav ah I wouldn't call that rotating, that's just applying a mask. rotation is when you rotate the bits like 1234 -> 2341. The reason is that any potential duplicate candidate would need to pass the chunk test both times before we do the full 64-bit check, reducing then number of candidates.

I see, why is the rotation safe? ie. we guaranteed that it wont be a duplicate if they dont match?

vinay553 · 2026-06-08T02:19:23Z

@edwinpav ah I wouldn't call that rotating, that's just applying a mask. rotation is when you rotate the bits like 1234 -> 2341. The reason is that any potential duplicate candidate would need to pass the chunk test both times before we do the full 64-bit check, reducing then number of candidates.

I see, why is the rotation safe? ie. we guaranteed that it wont be a duplicate if they dont match?

We can talk abt the algo in our 1:1

Add local pHash deduplication utility

e196bda

vinay553 changed the title ~~[codex] Add local pHash deduplication utility~~ Add local pHash deduplication utility Jun 4, 2026

vinay553 marked this pull request as ready for review June 4, 2026 15:41

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread nucleus/local_deduplication.py Outdated

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread nucleus/local_deduplication.py Outdated

Comment thread tests/test_local_deduplication.py

Comment thread nucleus/local_deduplication.py

vinay553 added 2 commits June 4, 2026 13:07

Address local dedup review feedback

88e711a

Fix local dedup pylint issues

c3d0726

edwinpav self-requested a review June 4, 2026 19:44

edwinpav reviewed Jun 4, 2026

View reviewed changes

edwinpav approved these changes Jun 4, 2026

View reviewed changes

Address dedup review nits and type annotations

a7042b0

Clarify rotated partition purpose

4654cf3

Run pre-commit formatting

a0409c0

vinay553 merged commit c003c84 into master Jun 5, 2026
8 checks passed

vinay553 deleted the codex/local-phash-dedup branch June 5, 2026 21:27

vinay553 mentioned this pull request Jun 8, 2026

Release v0.18.4 #463

Merged

Conversation

vinay553 commented Jun 4, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Writeup: Local pHash Deduplication

Baseline Problem

Technique 1: Fail Fast and Normalize Once

Technique 2: Integer pHashes and Integer Hamming Distance

Technique 3: Threshold-Specific Fast Paths

Technique 4: Chunking With the Pigeonhole Principle

Technique 5: Two Partitions With Cheap Rotation

Technique 6: Reusable Candidate Marks Instead of Set Intersections

Real Dataset Profile

Validation

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

vinay553 commented Jun 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edwinpav commented Jun 4, 2026

Uh oh!

edwinpav left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinay553 commented Jun 5, 2026

Uh oh!

edwinpav commented Jun 5, 2026

Uh oh!

edwinpav commented Jun 5, 2026

Uh oh!

vinay553 commented Jun 5, 2026

Uh oh!

edwinpav commented Jun 5, 2026

Uh oh!

Uh oh!

vinay553 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinay553 commented Jun 4, 2026 •

edited by greptile-apps Bot

Loading

edwinpav left a comment •

edited

Loading