diff --git a/CHANGELOG.md b/CHANGELOG.md index 019af44e..d427a495 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,11 @@ All notable changes to the [Nucleus Python Client](https://github.com/scaleapi/n The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.18.3](https://github.com/scaleapi/nucleus-python-client/releases/tag/v0.18.3) - 2026-05-18 + +### Added +- `DatasetItem.phash` field exposing the 64-character "0/1" perceptual-hash string when populated by the Nucleus backend. Available on every SDK method that yields a `DatasetItem` (e.g. `items_and_annotation_generator`, `items_generator`, `query_items`, `dataset.items`, `iloc`/`refloc`/`loc`). + ## [0.18.2](https://github.com/scaleapi/nucleus-python-client/releases/tag/v0.18.2) - 2026-05-08 ### Added diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..472f256f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,52 @@ +# CLAUDE.md + +Notes for Claude Code when working in this repo (the Nucleus Python SDK). + +## What this repo is + +The official Python client for Nucleus. Wraps the `/v1/nucleus` REST endpoints on `scaleapi`. Distributed on PyPI as `scale-nucleus`. + +- Sources live under `nucleus/`. +- Backend lives in the `scaleapi` repo at `server/src/routes/v1/select.ts` and `server/src/lib/select/api/`. +- The default API base URL is `NUCLEUS_ENDPOINT = "https://api.scale.com/v1/nucleus"` (`nucleus/constants.py`). Override via the `endpoint=` kwarg or `NUCLEUS_ENDPOINT` env var (e.g. point at fedramp). + +## Release workflow + +Releases are version-numbered with [Semantic Versioning](https://semver.org/) and tracked in `CHANGELOG.md` using the [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. + +When making a user-facing change, the convention (see PRs #459, #455) is: + +1. Bump `version = "..."` in `pyproject.toml` under `[tool.poetry]`. This is the single version source — there is no `__version__` in `nucleus/__init__.py`. + - Patch bump for additive, backwards-compatible changes (new fields, new methods). + - Minor bump for new features that change behaviour or remove deprecated paths. + - Major bump for breaking changes (Python version drops, sentinel removal, etc.). +2. Prepend a `## [X.Y.Z](https://github.com/scaleapi/nucleus-python-client/releases/tag/vX.Y.Z) - YYYY-MM-DD` section to `CHANGELOG.md` with `### Added` / `### Changed` / `### Fixed` / `### Removed` subsections as appropriate. +3. Commit the version bump + CHANGELOG entry alongside the code change in the same PR. + +Pure refactors / doc-only PRs (#456) sometimes skip the version bump. When in doubt, bump. + +## Branch and PR conventions + +- Branch naming: `/` (e.g. `vinayparakala/expose-phash-on-dataset-item`). +- PR title commonly starts with the Linear ticket: `[DE-XXXX] ` — see `git log --oneline -20`. +- PRs land via squash merge. + +## Architecture pointers + +- `nucleus/__init__.py` — `NucleusClient`, top-level operations. +- `nucleus/dataset.py` — `Dataset` class. Most user-facing methods live here (item upload/fetch, generators, queries, slices, autotags, exports). Generators page through the backend via `nucleus/utils.py:paginate_generator`. +- `nucleus/dataset_item.py` — `DatasetItem` dataclass. **`DatasetItem.from_json` is the single deserialization entry point** for items coming back from the API — every SDK method that returns a `DatasetItem` (generators, queries, `iloc`/`refloc`/`loc`, the `items` property) routes through it. To expose a new server-side field on items, add it to the dataclass + `from_json` and you're done on the SDK side. +- `nucleus/utils.py` — `convert_export_payload` and `format_dataset_item_response` are the shared shapers used by the export and single-item endpoints. They wrap raw JSON into typed objects via the respective `from_json` classmethods. +- `nucleus/constants.py` — All API payload keys are constants here. When adding a new field, add a `*_KEY` constant first and reference it from `from_json` / `to_payload` rather than inlining the string. +- `nucleus/annotation.py`, `nucleus/prediction.py` — Annotation and prediction types. Each has its own `from_json` / `to_payload`. + +## Testing + +Run the suite from the repo root: + +```bash +poetry install +poetry run pytest tests +``` + +Many tests require a real `NUCLEUS_API_KEY` and hit the live API; use `pytest -k ` to scope. Pre-commit hooks (`.pre-commit-config.yaml`) run black, ruff, isort. diff --git a/nucleus/constants.py b/nucleus/constants.py index ebad94f5..b2503473 100644 --- a/nucleus/constants.py +++ b/nucleus/constants.py @@ -124,6 +124,7 @@ OBJECT_IDS_KEY = "object_ids" P1_KEY = "p1" P2_KEY = "p2" +PHASH_KEY = "phash" POINTS_KEY = "points" POINTCLOUD_KEY = "pointcloud" POINTCLOUD_LOCATION_KEY = "pointcloud_location" diff --git a/nucleus/dataset_item.py b/nucleus/dataset_item.py index 6b90f35a..45440b45 100644 --- a/nucleus/dataset_item.py +++ b/nucleus/dataset_item.py @@ -17,6 +17,7 @@ INDEX_ID_KEY, METADATA_KEY, ORIGINAL_IMAGE_URL_KEY, + PHASH_KEY, POINTCLOUD_URL_KEY, PROCESSED_URL_KEY, REFERENCE_ID_KEY, @@ -123,6 +124,10 @@ class DatasetItem: # pylint: disable=R0902 embedding_info: Optional[DatasetItemEmbeddingInfo] = None width: Optional[int] = None height: Optional[int] = None + # Perceptual hash of the underlying image as a 64-character "0/1" binary + # string. Populated by the Nucleus backend on items that have been pHash + # backfilled; None for pointcloud items or items without a backfilled hash. + phash: Optional[str] = None def __post_init__(self): assert self.reference_id is not None, "reference_id is required." @@ -178,6 +183,7 @@ def from_json(cls, payload: dict): pointcloud_location=pointcloud_url, reference_id=payload.get(REFERENCE_ID_KEY), metadata=payload.get(METADATA_KEY, {}), + phash=payload.get(PHASH_KEY), ) def local_file_exists(self): diff --git a/pyproject.toml b/pyproject.toml index 772decb2..dd07937e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -25,7 +25,7 @@ ignore = ["E501", "E741", "E731", "F401"] # Easy ignore for getting it running [tool.poetry] name = "scale-nucleus" -version = "0.18.2" +version = "0.18.3" description = "The official Python client library for Nucleus, the Data Platform for AI" license = "MIT" authors = ["Scale AI Nucleus Team "]