Skip to content

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461

Open
vinay553 wants to merge 1 commit into
masterfrom
vinayparakala/expose-phash-on-dataset-item
Open

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461
vinay553 wants to merge 1 commit into
masterfrom
vinayparakala/expose-phash-on-dataset-item

Conversation

@vinay553
Copy link
Copy Markdown
Contributor

@vinay553 vinay553 commented May 18, 2026

Summary

Expose the perceptual-hash (pHash) of dataset items through the SDK so ML workflows (dedup, near-duplicate detection) can access it without a separate fetch.

  • Adds a new phash: Optional[str] field to the DatasetItem dataclass — 64-character "0/1" binary string when backfilled by the backend, None otherwise.
  • Threads phash=payload.get(PHASH_KEY) into DatasetItem.from_json. Because every SDK method that returns a DatasetItem goes through from_json, this single change exposes item.phash on:
    • items_and_annotation_generator
    • items_generator / dataset.items
    • query_items
    • iloc / refloc / loc
  • Bumps version 0.18.2 → 0.18.3 and adds a CHANGELOG entry per the project's Keep-a-Changelog convention.
  • Adds a top-level CLAUDE.md capturing release workflow, branch/PR conventions, and the from_json-centralization insight for future agent sessions.

The backend changes that actually populate phash on the responses ship in a paired scaleapi PR (DE-7859 / [scaleapi link TBD]). Endpoints that already select * from dataset_item (itemsPage, datasetItems) get phash for free; the export/scene paths get an explicit column add.

Test plan

  • Local install: poetry install && poetry run python -c "from nucleus.dataset_item import DatasetItem; print(DatasetItem.from_json({'reference_id':'r', 'image_url':'x.jpg', 'phash':'1'*64}).phash)" prints the hash.
  • DatasetItem.from_json falls back to None when the backend omits phash (existing test fixtures).
  • Smoke test against fedramp-prod once the paired backend PR is deployed: client.get_dataset(...).items_and_annotation_generator(...) yields items with item.phash populated.
  • CHANGELOG renders correctly on the release tag.

🤖 Generated with Claude Code

Greptile Summary

This PR exposes the backend-computed perceptual hash (phash) on DatasetItem by adding a single Optional[str] field and wiring it through the existing from_json deserialization entry point, which all item-returning SDK methods already use. The version is bumped to 0.18.3 with a matching changelog entry.

  • nucleus/dataset_item.py: Adds phash: Optional[str] = None to the dataclass and phash=payload.get(PHASH_KEY) to from_json; to_payload correctly omits it since the field is backend-computed and read-only.
  • nucleus/constants.py: Adds PHASH_KEY = "phash" in alphabetical order, consistent with the project's convention of centralizing all payload key strings as constants.
  • CLAUDE.md: New agent-facing documentation capturing release workflow, architecture pointers, and the from_json-centralization pattern.

Confidence Score: 4/5

Safe to merge — the change is a pure field addition on a read-only deserialization path with no impact on upload or existing behavior.

The implementation is minimal and correct: from_json uses .get() so missing phash keys default to None, to_payload is untouched so uploads are unaffected, and the constant follows the established naming convention. The only gap is that the class docstring doesn't document the new field, leaving users to discover it from source or the changelog.

nucleus/dataset_item.py — the class docstring should be updated to document phash, its format, and its read-only nature.

Important Files Changed

Filename Overview
nucleus/dataset_item.py Adds phash: Optional[str] field to DatasetItem dataclass and populates it via from_json; to_payload intentionally omits it since it is backend-computed. Class docstring is not updated to document the new field.
nucleus/constants.py Adds PHASH_KEY = "phash" constant in alphabetical order — straightforward and correct.
CHANGELOG.md Adds v0.18.3 changelog entry following Keep-a-Changelog convention; entry accurately describes the exposed phash field.
CLAUDE.md New agent-facing documentation file capturing release workflow, architecture pointers, and the from_json-centralization pattern — no code impact.
pyproject.toml Bumps version from 0.18.2 to 0.18.3 — correct patch bump for a backward-compatible field addition.

Sequence Diagram

sequenceDiagram
    participant U as User / ML workflow
    participant SDK as Nucleus SDK
    participant API as Nucleus Backend

    U->>SDK: dataset.items / items_and_annotation_generator() / query_items() / iloc()
    SDK->>API: "GET /v1/nucleus/dataset/{id}/items (or similar)"
    API-->>SDK: JSON payload (includes "phash" when backfilled)
    SDK->>SDK: DatasetItem.from_json(payload)
    Note over SDK: phash = payload.get("phash")<br/>→ 64-char "0/1" string or None
    SDK-->>U: "DatasetItem(phash="010110…" | None)"
    U->>U: item.phash  ← dedup / near-dup detection
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nucleus/dataset_item.py:110-118
The `DatasetItem` class docstring documents `image_location`, `pointcloud_location`, `reference_id`, and `metadata` as constructor args, but the new `phash` field is not mentioned. Users reading the docstring won't know the field exists, its format (64-char "0/1" string), or that it is read-only (backend-populated, not sent in `to_payload`).

```suggestion
          Context Attachments may be provided to display the attachments side by side with the dataset
          item in the Detail View by specifying
          `{ "context_attachments": [ { "attachment": 'https://example.com/1' }, { "attachment": 'https://example.com/2' }, ... ] }`.

          .. todo ::
              Shorten this once we have a guide migrated for metadata, or maybe link
              from other places to here.

        phash (Optional[str]): Read-only. The perceptual hash of the item's
          image as a 64-character "0/1" binary string, populated by the Nucleus
          backend for items that have been pHash-backfilled. ``None`` for
          pointcloud items or items that have not yet been backfilled. This
          field is not sent to the API on upload (``to_payload`` omits it).

    """
```

Reviews (1): Last reviewed commit: "[DE-7859] Expose pHash on DatasetItem (v..." | Re-trigger Greptile

Add a `phash` field to the DatasetItem dataclass and thread it through
`from_json`. Because every SDK method that returns a DatasetItem
(items_and_annotation_generator, items_generator, query_items,
dataset.items, iloc/refloc/loc) deserializes through DatasetItem.from_json,
exposing the field there is sufficient — no per-method changes required.

Also adds a top-level CLAUDE.md with release/branch conventions and
architecture pointers for future Claude Code sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant