[DE-7859] Expose pHash on DatasetItem (v0.18.3) by vinay553 · Pull Request #461 · scaleapi/nucleus-python-client

vinay553 · 2026-05-18T22:51:16Z

Summary

Expose the perceptual-hash (pHash) of dataset items through the SDK so ML workflows (dedup, near-duplicate detection) can access it without a separate fetch.

Adds a new phash: Optional[str] field to the DatasetItem dataclass — 64-character "0/1" binary string when backfilled by the backend, None otherwise.
Threads phash=payload.get(PHASH_KEY) into DatasetItem.from_json. Because every SDK method that returns a DatasetItem goes through from_json, this single change exposes item.phash on:
- items_and_annotation_generator
- items_generator / dataset.items
- query_items
- iloc / refloc / loc
Bumps version 0.18.2 → 0.18.3 and adds a CHANGELOG entry per the project's Keep-a-Changelog convention.
Adds a top-level CLAUDE.md capturing release workflow, branch/PR conventions, and the from_json-centralization insight for future agent sessions.

Test plan

Local install: poetry install && poetry run python -c "from nucleus.dataset_item import DatasetItem; print(DatasetItem.from_json({'reference_id':'r', 'image_url':'x.jpg', 'phash':'1'*64}).phash)" prints the hash.
DatasetItem.from_json falls back to None when the backend omits phash (existing test fixtures).
Smoke test against fedramp-prod once the paired backend PR is deployed: client.get_dataset(...).items_and_annotation_generator(...) yields items with item.phash populated.
CHANGELOG renders correctly on the release tag.

🤖 Generated with Claude Code

Greptile Summary

This PR exposes the perceptual hash (phash) of dataset items through the Python SDK by adding an Optional[str] field to DatasetItem and threading it through the single from_json deserialization entry-point. It also fixes several flaky integration tests (non-deterministic step counts, tag-removal eventual consistency, unordered dedup results) and bumps the version to 0.18.3.

DatasetItem.phash: Added as field(default=None, compare=False) so locally-constructed items never spuriously differ from API-returned ones; populated via payload.get(PHASH_KEY) in from_json, making it available on all SDK methods that yield DatasetItem objects without any further changes.
Test fixes: Four deduplication assertions switched from list equality to set equality; the async-append step-count check replaced with a relative completed == total assertion; tag-removal test updated to call get_tags() separately with a brief sleep for eventual consistency.
CLAUDE.md: New file documenting the repo's release workflow and architecture conventions for future agent sessions.

Confidence Score: 5/5

Safe to merge — the change is a single additive field threaded through one deserialization method, with no impact on existing serialization or equality semantics for code that doesn't use phash.

The phash field is read-only (not included in to_payload), defaults to None, and is excluded from __eq__ with a clear documented rationale. The test fixes address real flakiness rather than weakening assertions. No existing behaviour is altered.

No files require special attention.

Important Files Changed

Filename	Overview
nucleus/dataset_item.py	Adds `phash: Optional[str]` field to `DatasetItem` using `field(default=None, compare=False)` and threads it through `from_json`; correct and well-documented.
nucleus/constants.py	Adds `PHASH_KEY = "phash"` constant in alphabetical order; consistent with the repo's pattern of centralising API key strings.
tests/test_dataset.py	Fixes two flaky assertions: replaces pinned step-count check with a relative equality, and adds `time.sleep` + `get_tags()` call for eventual-consistency in tag removal test.
tests/test_deduplication.py	Replaces ordered list equality with set equality for `unique_item_ids` in four tests to handle non-deterministic backend ordering.
CHANGELOG.md	Adds v0.18.3 entry documenting the new `phash` field; follows Keep-a-Changelog format.
CLAUDE.md	New documentation file capturing release workflow, architecture pointers, and the `from_json`-centralisation insight for future agent sessions.
pyproject.toml	Bumps version from 0.18.2 to 0.18.3; correct patch-level bump for an additive, backwards-compatible change.

Sequence Diagram

sequenceDiagram
    participant User as User / ML workflow
    participant SDK as Nucleus SDK
    participant API as Nucleus Backend

    User->>SDK: dataset.items_and_annotation_generator()
    SDK->>API: GET /items (paginated)
    API-->>SDK: "JSON payload { reference_id, image_url, phash, ... }"
    SDK->>SDK: DatasetItem.from_json(payload)
    Note over SDK: phash = payload.get(PHASH_KEY)
    SDK-->>User: "DatasetItem(reference_id=..., phash="0110...")"

    User->>SDK: item.phash
    SDK-->>User: "0110..." (or None if backend omits it)

_{Reviews (7): Last reviewed commit: "test_dataset_tags: poll fresh get_tags()..." | Re-trigger Greptile}

Add a `phash` field to the DatasetItem dataclass and thread it through `from_json`. Because every SDK method that returns a DatasetItem (items_and_annotation_generator, items_generator, query_items, dataset.items, iloc/refloc/loc) deserializes through DatasetItem.from_json, exposing the field there is sufficient — no per-method changes required. Also adds a top-level CLAUDE.md with release/branch conventions and architecture pointers for future Claude Code sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The upload job pipeline plans with N total_steps initially, then dynamically collapses to a single step once it knows how to short-circuit (small input → batched upload). By the time sleep_until_complete() returns, status() always reports total_steps=1, completed_steps=1 — so the hard-coded expectation of 5/5 deterministically fails on the current backend. Drop the step-count assertions and keep the meaningful invariants: job completed successfully, progress is 1.00, and completed_steps == total_steps (whatever they are). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vinay553 · 2026-05-19T20:31:27Z

+    # Pinning specific step counts couples this test to the backend pipeline
+    # shape (the upload job is planned with N steps, then dynamically
+    # collapses to fewer once it knows how to short-circuit). Only check the
+    # outcomes that the SDK contract guarantees.
    expected = {
        "job_id": job.job_id,
        "status": "Completed",
        "job_progress": "1.00",
-        "completed_steps": 5,
-        "total_steps": 5,
    }
    assert_partial_equality(expected, status)
+    assert status["completed_steps"] == status["total_steps"]



Fix flaky test.

vinay553 · 2026-05-19T20:32:17Z

build_test should pass once api.scale.com is redeployed with https://github.com/scaleapi/scaleapi/pull/143842.

edwinpav · 2026-05-19T23:54:12Z

there's some nucleus commercial temporal fixes that are in progress wrt to moving nucleus to its own namespace so the async dedup tests won't pass until that is completed

update: the above is now fixed

there are three other tests still failing:

some dedup tests. the new dedup is still deterministic but has a different ordering for some paths as it now uses sortDeduplicationRowsByPhashThenId instead of just sorting by dataset_item_id. So the expected output fro the failing tests here just needs to be updated to match the diffs correctly (its same output just different order which is why tests are failing): https://app.circleci.com/pipelines/github/scaleapi/nucleus-python-client/2945/workflows/e2c78ede-7e8d-41b0-9af8-8dbc7c8d58e6/jobs/8513/parallel-runs/5?filterBy=ALL
runs 1 and 4 failures seem to do with the changes of this pr and with the dataset item tags

`test_deduplicate_*_by_ids` runs dedup over the surviving set returned from a prior dedup and asserts the second result equals the first. The set of survivors is well-defined, but the backend doesn't guarantee a stable list order across runs — the "kept" list depends on the order in which the deduplication loop visits items, and that order can differ between the whole-dataset (cursor-paginated) and by-ids (batched-by-input) code paths. Asserting list equality therefore fails intermittently when the same items come back in a different order. Switch all four call sites (image / video-scene / video-url / by-ids-returns-job) to set comparison. The other invariants (length, `original_count`) still hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adding `phash` as a regular dataclass field made every `item1 == item2` comparison sensitive to whether the backend had populated the hash — which it doesn't on every endpoint (some handlers cherry-pick columns and exclude phash, others select all columns and include it). Tests that constructed a DatasetItem locally and then compared it to the backend round-trip (test_append_and_export, test_slice_dataset_item_iterator) broke as a result. phash is a derived value (computed from image_location), so two items with the same source image should compare equal regardless of whether their hashes happen to be populated. Mark the field `compare=False` so auto-generated __eq__ ignores it, matching the natural semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ve_tags response The DELETE /tags handler refetches the tag list immediately after the delete and returns it. In prod that refetch can hit a read replica that hasn't yet replayed the DELETE, so the response includes the just-deleted tag — making the test fail. A separate follow-up request always sees the correct state (verified against api.scale.com — first poll is already consistent at ~25ms round-trip). Tighten the test against the post-state by polling get_tags() with a 5s settle window, rather than trusting the remove_tags response. Same change applied to the idempotent-remove follow-up assertion. Backend deferred — the inconsistency is bounded and not user-impacting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vinay553 · 2026-05-20T18:01:57Z

@edwinpav I just made the assertions order agnostic

vinay553 and others added 3 commits May 18, 2026 18:50

Tighten phash field comment

1c0a6ab

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vinay553 commented May 19, 2026

View reviewed changes

vinay553 requested review from a team and edwinpav May 19, 2026 20:33

edwinpav approved these changes May 19, 2026

View reviewed changes

vinay553 and others added 2 commits May 20, 2026 12:22

vinay553 force-pushed the vinayparakala/expose-phash-on-dataset-item branch 2 times, most recently from dce3d92 to 99f87af Compare May 20, 2026 17:51

vinay553 force-pushed the vinayparakala/expose-phash-on-dataset-item branch from 99f87af to b0c1468 Compare May 20, 2026 17:53

vinay553 merged commit 280f3bc into master May 20, 2026
8 checks passed

vinay553 deleted the vinayparakala/expose-phash-on-dataset-item branch May 20, 2026 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461
vinay553 merged 6 commits into
masterfrom
vinayparakala/expose-phash-on-dataset-item

vinay553 commented May 18, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

vinay553 May 19, 2026

Uh oh!

vinay553 commented May 19, 2026

Uh oh!

edwinpav commented May 19, 2026 •

edited

Loading

Uh oh!

vinay553 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vinay553 commented May 18, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

vinay553 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

vinay553 commented May 19, 2026

Uh oh!

edwinpav commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinay553 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinay553 commented May 18, 2026 •

edited by greptile-apps Bot

Loading

edwinpav commented May 19, 2026 •

edited

Loading