[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461
Open
vinay553 wants to merge 1 commit into
Open
Conversation
Add a `phash` field to the DatasetItem dataclass and thread it through `from_json`. Because every SDK method that returns a DatasetItem (items_and_annotation_generator, items_generator, query_items, dataset.items, iloc/refloc/loc) deserializes through DatasetItem.from_json, exposing the field there is sufficient — no per-method changes required. Also adds a top-level CLAUDE.md with release/branch conventions and architecture pointers for future Claude Code sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Expose the perceptual-hash (pHash) of dataset items through the SDK so ML workflows (dedup, near-duplicate detection) can access it without a separate fetch.
phash: Optional[str]field to theDatasetItemdataclass — 64-character "0/1" binary string when backfilled by the backend,Noneotherwise.phash=payload.get(PHASH_KEY)intoDatasetItem.from_json. Because every SDK method that returns aDatasetItemgoes throughfrom_json, this single change exposesitem.phashon:items_and_annotation_generatoritems_generator/dataset.itemsquery_itemsiloc/refloc/loc0.18.2 → 0.18.3and adds a CHANGELOG entry per the project's Keep-a-Changelog convention.CLAUDE.mdcapturing release workflow, branch/PR conventions, and thefrom_json-centralization insight for future agent sessions.The backend changes that actually populate
phashon the responses ship in a paired scaleapi PR (DE-7859 / [scaleapi link TBD]). Endpoints that alreadyselect *fromdataset_item(itemsPage,datasetItems) get phash for free; the export/scene paths get an explicit column add.Test plan
poetry install && poetry run python -c "from nucleus.dataset_item import DatasetItem; print(DatasetItem.from_json({'reference_id':'r', 'image_url':'x.jpg', 'phash':'1'*64}).phash)"prints the hash.DatasetItem.from_jsonfalls back toNonewhen the backend omitsphash(existing test fixtures).client.get_dataset(...).items_and_annotation_generator(...)yields items withitem.phashpopulated.🤖 Generated with Claude Code
Greptile Summary
This PR exposes the backend-computed perceptual hash (
phash) onDatasetItemby adding a singleOptional[str]field and wiring it through the existingfrom_jsondeserialization entry point, which all item-returning SDK methods already use. The version is bumped to0.18.3with a matching changelog entry.nucleus/dataset_item.py: Addsphash: Optional[str] = Noneto the dataclass andphash=payload.get(PHASH_KEY)tofrom_json;to_payloadcorrectly omits it since the field is backend-computed and read-only.nucleus/constants.py: AddsPHASH_KEY = "phash"in alphabetical order, consistent with the project's convention of centralizing all payload key strings as constants.CLAUDE.md: New agent-facing documentation capturing release workflow, architecture pointers, and thefrom_json-centralization pattern.Confidence Score: 4/5
Safe to merge — the change is a pure field addition on a read-only deserialization path with no impact on upload or existing behavior.
The implementation is minimal and correct:
from_jsonuses.get()so missingphashkeys default toNone,to_payloadis untouched so uploads are unaffected, and the constant follows the established naming convention. The only gap is that the class docstring doesn't document the new field, leaving users to discover it from source or the changelog.nucleus/dataset_item.py — the class docstring should be updated to document
phash, its format, and its read-only nature.Important Files Changed
phash: Optional[str]field toDatasetItemdataclass and populates it viafrom_json;to_payloadintentionally omits it since it is backend-computed. Class docstring is not updated to document the new field.PHASH_KEY = "phash"constant in alphabetical order — straightforward and correct.phashfield.from_json-centralization pattern — no code impact.0.18.2to0.18.3— correct patch bump for a backward-compatible field addition.Sequence Diagram
sequenceDiagram participant U as User / ML workflow participant SDK as Nucleus SDK participant API as Nucleus Backend U->>SDK: dataset.items / items_and_annotation_generator() / query_items() / iloc() SDK->>API: "GET /v1/nucleus/dataset/{id}/items (or similar)" API-->>SDK: JSON payload (includes "phash" when backfilled) SDK->>SDK: DatasetItem.from_json(payload) Note over SDK: phash = payload.get("phash")<br/>→ 64-char "0/1" string or None SDK-->>U: "DatasetItem(phash="010110…" | None)" U->>U: item.phash ← dedup / near-dup detectionPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "[DE-7859] Expose pHash on DatasetItem (v..." | Re-trigger Greptile