[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461
Conversation
Add a `phash` field to the DatasetItem dataclass and thread it through `from_json`. Because every SDK method that returns a DatasetItem (items_and_annotation_generator, items_generator, query_items, dataset.items, iloc/refloc/loc) deserializes through DatasetItem.from_json, exposing the field there is sufficient — no per-method changes required. Also adds a top-level CLAUDE.md with release/branch conventions and architecture pointers for future Claude Code sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The upload job pipeline plans with N total_steps initially, then dynamically collapses to a single step once it knows how to short-circuit (small input → batched upload). By the time sleep_until_complete() returns, status() always reports total_steps=1, completed_steps=1 — so the hard-coded expectation of 5/5 deterministically fails on the current backend. Drop the step-count assertions and keep the meaningful invariants: job completed successfully, progress is 1.00, and completed_steps == total_steps (whatever they are). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| # Pinning specific step counts couples this test to the backend pipeline | ||
| # shape (the upload job is planned with N steps, then dynamically | ||
| # collapses to fewer once it knows how to short-circuit). Only check the | ||
| # outcomes that the SDK contract guarantees. | ||
| expected = { | ||
| "job_id": job.job_id, | ||
| "status": "Completed", | ||
| "job_progress": "1.00", | ||
| "completed_steps": 5, | ||
| "total_steps": 5, | ||
| } | ||
| assert_partial_equality(expected, status) | ||
| assert status["completed_steps"] == status["total_steps"] | ||
|
|
|
build_test should pass once api.scale.com is redeployed with https://github.com/scaleapi/scaleapi/pull/143842. |
|
there's some nucleus commercial temporal fixes that are in progress wrt to moving nucleus to its own namespace so the async dedup tests won't pass until that is completed update: the above is now fixed there are three other tests still failing:
|
`test_deduplicate_*_by_ids` runs dedup over the surviving set returned from a prior dedup and asserts the second result equals the first. The set of survivors is well-defined, but the backend doesn't guarantee a stable list order across runs — the "kept" list depends on the order in which the deduplication loop visits items, and that order can differ between the whole-dataset (cursor-paginated) and by-ids (batched-by-input) code paths. Asserting list equality therefore fails intermittently when the same items come back in a different order. Switch all four call sites (image / video-scene / video-url / by-ids-returns-job) to set comparison. The other invariants (length, `original_count`) still hold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adding `phash` as a regular dataclass field made every `item1 == item2` comparison sensitive to whether the backend had populated the hash — which it doesn't on every endpoint (some handlers cherry-pick columns and exclude phash, others select all columns and include it). Tests that constructed a DatasetItem locally and then compared it to the backend round-trip (test_append_and_export, test_slice_dataset_item_iterator) broke as a result. phash is a derived value (computed from image_location), so two items with the same source image should compare equal regardless of whether their hashes happen to be populated. Mark the field `compare=False` so auto-generated __eq__ ignores it, matching the natural semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dce3d92 to
99f87af
Compare
…ve_tags response The DELETE /tags handler refetches the tag list immediately after the delete and returns it. In prod that refetch can hit a read replica that hasn't yet replayed the DELETE, so the response includes the just-deleted tag — making the test fail. A separate follow-up request always sees the correct state (verified against api.scale.com — first poll is already consistent at ~25ms round-trip). Tighten the test against the post-state by polling get_tags() with a 5s settle window, rather than trusting the remove_tags response. Same change applied to the idempotent-remove follow-up assertion. Backend deferred — the inconsistency is bounded and not user-impacting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
99f87af to
b0c1468
Compare
|
@edwinpav I just made the assertions order agnostic |
Summary
Expose the perceptual-hash (pHash) of dataset items through the SDK so ML workflows (dedup, near-duplicate detection) can access it without a separate fetch.
phash: Optional[str]field to theDatasetItemdataclass — 64-character "0/1" binary string when backfilled by the backend,Noneotherwise.phash=payload.get(PHASH_KEY)intoDatasetItem.from_json. Because every SDK method that returns aDatasetItemgoes throughfrom_json, this single change exposesitem.phashon:items_and_annotation_generatoritems_generator/dataset.itemsquery_itemsiloc/refloc/loc0.18.2 → 0.18.3and adds a CHANGELOG entry per the project's Keep-a-Changelog convention.CLAUDE.mdcapturing release workflow, branch/PR conventions, and thefrom_json-centralization insight for future agent sessions.Test plan
poetry install && poetry run python -c "from nucleus.dataset_item import DatasetItem; print(DatasetItem.from_json({'reference_id':'r', 'image_url':'x.jpg', 'phash':'1'*64}).phash)"prints the hash.DatasetItem.from_jsonfalls back toNonewhen the backend omitsphash(existing test fixtures).client.get_dataset(...).items_and_annotation_generator(...)yields items withitem.phashpopulated.🤖 Generated with Claude Code
Greptile Summary
This PR exposes the perceptual hash (
phash) of dataset items through the Python SDK by adding anOptional[str]field toDatasetItemand threading it through the singlefrom_jsondeserialization entry-point. It also fixes several flaky integration tests (non-deterministic step counts, tag-removal eventual consistency, unordered dedup results) and bumps the version to 0.18.3.DatasetItem.phash: Added asfield(default=None, compare=False)so locally-constructed items never spuriously differ from API-returned ones; populated viapayload.get(PHASH_KEY)infrom_json, making it available on all SDK methods that yieldDatasetItemobjects without any further changes.completed == totalassertion; tag-removal test updated to callget_tags()separately with a brief sleep for eventual consistency.CLAUDE.md: New file documenting the repo's release workflow and architecture conventions for future agent sessions.Confidence Score: 5/5
Safe to merge — the change is a single additive field threaded through one deserialization method, with no impact on existing serialization or equality semantics for code that doesn't use
phash.The
phashfield is read-only (not included into_payload), defaults toNone, and is excluded from__eq__with a clear documented rationale. The test fixes address real flakiness rather than weakening assertions. No existing behaviour is altered.No files require special attention.
Important Files Changed
phash: Optional[str]field toDatasetItemusingfield(default=None, compare=False)and threads it throughfrom_json; correct and well-documented.PHASH_KEY = "phash"constant in alphabetical order; consistent with the repo's pattern of centralising API key strings.time.sleep+get_tags()call for eventual-consistency in tag removal test.unique_item_idsin four tests to handle non-deterministic backend ordering.phashfield; follows Keep-a-Changelog format.from_json-centralisation insight for future agent sessions.Sequence Diagram
sequenceDiagram participant User as User / ML workflow participant SDK as Nucleus SDK participant API as Nucleus Backend User->>SDK: dataset.items_and_annotation_generator() SDK->>API: GET /items (paginated) API-->>SDK: "JSON payload { reference_id, image_url, phash, ... }" SDK->>SDK: DatasetItem.from_json(payload) Note over SDK: phash = payload.get(PHASH_KEY) SDK-->>User: "DatasetItem(reference_id=..., phash="0110...")" User->>SDK: item.phash SDK-->>User: "0110..." (or None if backend omits it)Reviews (7): Last reviewed commit: "test_dataset_tags: poll fresh get_tags()..." | Re-trigger Greptile