Skip to content

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461

Merged
vinay553 merged 6 commits into
masterfrom
vinayparakala/expose-phash-on-dataset-item
May 20, 2026
Merged

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461
vinay553 merged 6 commits into
masterfrom
vinayparakala/expose-phash-on-dataset-item

Conversation

@vinay553
Copy link
Copy Markdown
Contributor

@vinay553 vinay553 commented May 18, 2026

Summary

Expose the perceptual-hash (pHash) of dataset items through the SDK so ML workflows (dedup, near-duplicate detection) can access it without a separate fetch.

  • Adds a new phash: Optional[str] field to the DatasetItem dataclass — 64-character "0/1" binary string when backfilled by the backend, None otherwise.
  • Threads phash=payload.get(PHASH_KEY) into DatasetItem.from_json. Because every SDK method that returns a DatasetItem goes through from_json, this single change exposes item.phash on:
    • items_and_annotation_generator
    • items_generator / dataset.items
    • query_items
    • iloc / refloc / loc
  • Bumps version 0.18.2 → 0.18.3 and adds a CHANGELOG entry per the project's Keep-a-Changelog convention.
  • Adds a top-level CLAUDE.md capturing release workflow, branch/PR conventions, and the from_json-centralization insight for future agent sessions.

Test plan

  • Local install: poetry install && poetry run python -c "from nucleus.dataset_item import DatasetItem; print(DatasetItem.from_json({'reference_id':'r', 'image_url':'x.jpg', 'phash':'1'*64}).phash)" prints the hash.
  • DatasetItem.from_json falls back to None when the backend omits phash (existing test fixtures).
  • Smoke test against fedramp-prod once the paired backend PR is deployed: client.get_dataset(...).items_and_annotation_generator(...) yields items with item.phash populated.
  • CHANGELOG renders correctly on the release tag.

🤖 Generated with Claude Code

Greptile Summary

This PR exposes the perceptual hash (phash) of dataset items through the Python SDK by adding an Optional[str] field to DatasetItem and threading it through the single from_json deserialization entry-point. It also fixes several flaky integration tests (non-deterministic step counts, tag-removal eventual consistency, unordered dedup results) and bumps the version to 0.18.3.

  • DatasetItem.phash: Added as field(default=None, compare=False) so locally-constructed items never spuriously differ from API-returned ones; populated via payload.get(PHASH_KEY) in from_json, making it available on all SDK methods that yield DatasetItem objects without any further changes.
  • Test fixes: Four deduplication assertions switched from list equality to set equality; the async-append step-count check replaced with a relative completed == total assertion; tag-removal test updated to call get_tags() separately with a brief sleep for eventual consistency.
  • CLAUDE.md: New file documenting the repo's release workflow and architecture conventions for future agent sessions.

Confidence Score: 5/5

Safe to merge — the change is a single additive field threaded through one deserialization method, with no impact on existing serialization or equality semantics for code that doesn't use phash.

The phash field is read-only (not included in to_payload), defaults to None, and is excluded from __eq__ with a clear documented rationale. The test fixes address real flakiness rather than weakening assertions. No existing behaviour is altered.

No files require special attention.

Important Files Changed

Filename Overview
nucleus/dataset_item.py Adds phash: Optional[str] field to DatasetItem using field(default=None, compare=False) and threads it through from_json; correct and well-documented.
nucleus/constants.py Adds PHASH_KEY = "phash" constant in alphabetical order; consistent with the repo's pattern of centralising API key strings.
tests/test_dataset.py Fixes two flaky assertions: replaces pinned step-count check with a relative equality, and adds time.sleep + get_tags() call for eventual-consistency in tag removal test.
tests/test_deduplication.py Replaces ordered list equality with set equality for unique_item_ids in four tests to handle non-deterministic backend ordering.
CHANGELOG.md Adds v0.18.3 entry documenting the new phash field; follows Keep-a-Changelog format.
CLAUDE.md New documentation file capturing release workflow, architecture pointers, and the from_json-centralisation insight for future agent sessions.
pyproject.toml Bumps version from 0.18.2 to 0.18.3; correct patch-level bump for an additive, backwards-compatible change.

Sequence Diagram

sequenceDiagram
    participant User as User / ML workflow
    participant SDK as Nucleus SDK
    participant API as Nucleus Backend

    User->>SDK: dataset.items_and_annotation_generator()
    SDK->>API: GET /items (paginated)
    API-->>SDK: "JSON payload { reference_id, image_url, phash, ... }"
    SDK->>SDK: DatasetItem.from_json(payload)
    Note over SDK: phash = payload.get(PHASH_KEY)
    SDK-->>User: "DatasetItem(reference_id=..., phash="0110...")"

    User->>SDK: item.phash
    SDK-->>User: "0110..." (or None if backend omits it)
Loading

Reviews (7): Last reviewed commit: "test_dataset_tags: poll fresh get_tags()..." | Re-trigger Greptile

vinay553 and others added 3 commits May 18, 2026 18:50
Add a `phash` field to the DatasetItem dataclass and thread it through
`from_json`. Because every SDK method that returns a DatasetItem
(items_and_annotation_generator, items_generator, query_items,
dataset.items, iloc/refloc/loc) deserializes through DatasetItem.from_json,
exposing the field there is sufficient — no per-method changes required.

Also adds a top-level CLAUDE.md with release/branch conventions and
architecture pointers for future Claude Code sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The upload job pipeline plans with N total_steps initially, then
dynamically collapses to a single step once it knows how to
short-circuit (small input → batched upload). By the time
sleep_until_complete() returns, status() always reports total_steps=1,
completed_steps=1 — so the hard-coded expectation of 5/5 deterministically
fails on the current backend.

Drop the step-count assertions and keep the meaningful invariants:
job completed successfully, progress is 1.00, and
completed_steps == total_steps (whatever they are).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tests/test_dataset.py
Comment on lines +335 to 346
# Pinning specific step counts couples this test to the backend pipeline
# shape (the upload job is planned with N steps, then dynamically
# collapses to fewer once it knows how to short-circuit). Only check the
# outcomes that the SDK contract guarantees.
expected = {
"job_id": job.job_id,
"status": "Completed",
"job_progress": "1.00",
"completed_steps": 5,
"total_steps": 5,
}
assert_partial_equality(expected, status)
assert status["completed_steps"] == status["total_steps"]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix flaky test.

@vinay553
Copy link
Copy Markdown
Contributor Author

build_test should pass once api.scale.com is redeployed with https://github.com/scaleapi/scaleapi/pull/143842.

@vinay553 vinay553 requested review from a team and edwinpav May 19, 2026 20:33
@edwinpav
Copy link
Copy Markdown
Contributor

edwinpav commented May 19, 2026

there's some nucleus commercial temporal fixes that are in progress wrt to moving nucleus to its own namespace so the async dedup tests won't pass until that is completed

update: the above is now fixed

there are three other tests still failing:

  • some dedup tests. the new dedup is still deterministic but has a different ordering for some paths as it now uses sortDeduplicationRowsByPhashThenId instead of just sorting by dataset_item_id. So the expected output fro the failing tests here just needs to be updated to match the diffs correctly (its same output just different order which is why tests are failing): https://app.circleci.com/pipelines/github/scaleapi/nucleus-python-client/2945/workflows/e2c78ede-7e8d-41b0-9af8-8dbc7c8d58e6/jobs/8513/parallel-runs/5?filterBy=ALL

  • runs 1 and 4 failures seem to do with the changes of this pr and with the dataset item tags

vinay553 and others added 2 commits May 20, 2026 12:22
`test_deduplicate_*_by_ids` runs dedup over the surviving set returned
from a prior dedup and asserts the second result equals the first.
The set of survivors is well-defined, but the backend doesn't guarantee
a stable list order across runs — the "kept" list depends on the order
in which the deduplication loop visits items, and that order can differ
between the whole-dataset (cursor-paginated) and by-ids (batched-by-input)
code paths.

Asserting list equality therefore fails intermittently when the same
items come back in a different order. Switch all four call sites
(image / video-scene / video-url / by-ids-returns-job) to set comparison.
The other invariants (length, `original_count`) still hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adding `phash` as a regular dataclass field made every `item1 == item2`
comparison sensitive to whether the backend had populated the hash —
which it doesn't on every endpoint (some handlers cherry-pick columns
and exclude phash, others select all columns and include it). Tests
that constructed a DatasetItem locally and then compared it to the
backend round-trip (test_append_and_export, test_slice_dataset_item_iterator)
broke as a result.

phash is a derived value (computed from image_location), so two items
with the same source image should compare equal regardless of whether
their hashes happen to be populated. Mark the field `compare=False` so
auto-generated __eq__ ignores it, matching the natural semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vinay553 vinay553 force-pushed the vinayparakala/expose-phash-on-dataset-item branch 2 times, most recently from dce3d92 to 99f87af Compare May 20, 2026 17:51
…ve_tags response

The DELETE /tags handler refetches the tag list immediately after the
delete and returns it. In prod that refetch can hit a read replica that
hasn't yet replayed the DELETE, so the response includes the just-deleted
tag — making the test fail. A separate follow-up request always sees the
correct state (verified against api.scale.com — first poll is already
consistent at ~25ms round-trip).

Tighten the test against the post-state by polling get_tags() with a 5s
settle window, rather than trusting the remove_tags response. Same change
applied to the idempotent-remove follow-up assertion. Backend deferred —
the inconsistency is bounded and not user-impacting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vinay553 vinay553 force-pushed the vinayparakala/expose-phash-on-dataset-item branch from 99f87af to b0c1468 Compare May 20, 2026 17:53
@vinay553
Copy link
Copy Markdown
Contributor Author

@edwinpav I just made the assertions order agnostic

@vinay553 vinay553 merged commit 280f3bc into master May 20, 2026
8 checks passed
@vinay553 vinay553 deleted the vinayparakala/expose-phash-on-dataset-item branch May 20, 2026 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants