Skip to content

Cosmos: Adds Per Partition Automatic Failover and Circuit Breaker Specs#3880

Merged
kundadebdatta merged 14 commits intorelease/azure_data_cosmos-previewsfrom
users/kundadebdatta/add_ppaf_spec_for_driver
Mar 16, 2026
Merged

Cosmos: Adds Per Partition Automatic Failover and Circuit Breaker Specs#3880
kundadebdatta merged 14 commits intorelease/azure_data_cosmos-previewsfrom
users/kundadebdatta/add_ppaf_spec_for_driver

Conversation

@kundadebdatta
Copy link
Copy Markdown
Member

@kundadebdatta kundadebdatta commented Mar 7, 2026

Description

Introduces the Per-Partition Automatic Failover (PPAF) & Per-Partition Circuit Breaker (PPCB) design spec for azure_data_cosmos_driver.

This spec describes partition-level failover mechanisms that complement the existing account-level failover in the driver's 7-stage operation pipeline. Instead of marking an entire region unavailable when a single partition becomes unhealthy, only the affected partition is routed to an alternate region — preserving local latency for healthy partitions.

What's in the spec

  • Two complementary mechanisms:
    • PPAF — per-partition failover for writes on single-master accounts, triggered by 403/3, 503, 429/3092, 410
    • PPCB — per-partition circuit breaker for reads (any account) and writes on multi-master accounts, threshold-gated
  • Component design: PartitionEndpointState, PartitionFailoverEntry, PartitionFailoverConfig — all managed via the driver's existing lock-free CAS pattern (no RwLock<HashMap> like the SDK)
  • Operation pipeline integration: How partition-level overrides plug into resolve_endpoint() (Stage 2), evaluate_transport_result() (Stage 5), and LocationStateStore::apply() (Stage 6)
  • Background failback loop: Periodic sweep that expires stale partition overrides, spawned via BackgroundTaskManager (Cosmos: Introduce BackgroundTaskManager To Spawn Background Tasks Using Tokio Runtime #3945)
  • Status code handling matrix: Complete mapping of HTTP status/sub-status codes to emitted LocationEffects
  • Configuration surface: All thresholds and intervals configurable via environment variables
  • Test coverage plan: Pure routing system tests, eligibility tests, circuit breaker counter tests, integration tests, and end-to-end operation loop tests
  • Prerequisites: Missing pieces that must be implemented (partition key range ID availability, ResourceType.is_partitioned(), env var reading, sync_account_properties integration)

Key design decisions

Decision Rationale
Immutable CAS snapshots (not RwLock<HashMap>) Follows driver's existing lock-free pattern; eliminates reader/writer contention on hot path
Two separate maps (PPAF vs PPCB) Avoids cross-contamination between single-master write failover and multi-master circuit breaker routing strategies
Plain counters (not AtomicI32) Entire PartitionEndpointState is swapped atomically via CAS — no need for interior atomic counters
Failback sweeps both maps Improvement over SDK which only sweeps PPCB; trivial with immutable-snapshot pattern
BackgroundTaskManager for failback loop Provides abort-on-drop, panic safety, and graceful shutdown (#3945)
Acceptable CAS counter loss under contention Delays threshold trigger by at most one failure — better trade-off than introducing locks

Dependencies

Files

File Action
azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md New

@github-actions github-actions bot added the Cosmos The azure_cosmos crate label Mar 7, 2026
@kundadebdatta kundadebdatta self-assigned this Mar 7, 2026
@kundadebdatta kundadebdatta moved this from Todo to In Progress in CosmosDB Go/Rust Crew Mar 7, 2026
@kundadebdatta kundadebdatta marked this pull request as ready for review March 10, 2026 23:08
@kundadebdatta kundadebdatta requested a review from a team as a code owner March 10, 2026 23:08
Copilot AI review requested due to automatic review settings March 10, 2026 23:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design/specification document describing Per-Partition Automatic Failover (PPAF) and Per-Partition Circuit Breaker (PPCB) behavior in the azure_data_cosmos SDK, including eligibility rules, retry-policy integration, background failback, and configuration surface.

Changes:

  • Introduces a detailed markdown spec for PPAF/PPCB design and request flows.
  • Documents status/substatus handling, configuration flags/env vars, and interaction with account-level failover.
  • Provides proposed test coverage areas and sequence diagrams.

@FabianMeiswinkel
Copy link
Copy Markdown
Member

Review Summary — PPAF & PPCB Specification

PR Intent: Adds a 1005-line design specification for Per-Partition Automatic Failover (PPAF) and Per-Partition Circuit Breaker (PPCB) mechanisms in the Cosmos DB Rust SDK.

Overall Assessment: The spec is well-structured and covers the architecture comprehensively. However, there are several correctness discrepancies between the spec and the actual implementation that could mislead anyone implementing the driver crate from this document. These should be resolved before merge.

Existing review comments: 5 (all from @copilot, all unresolved). My findings overlap with 3 of those; I have 9 new findings.

Key Issues

  • 3 Blocking: Status code/substatus mismatches and incorrect partition-marking claims that diverge from the actual implementation
  • 7 Recommendations: Missing struct fields, non-compiling sample code, misleading labels, TOCTOU race documentation gap
  • 3 Suggestions: Broken cross-reference, negative-duration edge case, missing test scenarios

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 11, 2026

@FabianMeiswinkel I've opened a new pull request, #3918, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 11, 2026

@FabianMeiswinkel I've opened a new pull request, #3919, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions/comments and overall I would like to bette runderstand how this interacts with the DOP-based principles in the driver pipeline.

@github-project-automation github-project-automation bot moved this from In Progress to Changes Requested in CosmosDB Go/Rust Crew Mar 11, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 11, 2026

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure_data_cosmos

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

@tvaron3 tvaron3 dismissed their stale review March 13, 2026 19:32

Replacing with updated review

Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Deep Review

Reviewed the PPAF/PPCB spec for correctness against the Java SDK implementation. The spec is thorough and well-structured, with the driver's CAS-based architecture clearly documented. Found 2 correctness issues in resource type eligibility checks, 1 completeness gap in the status code matrix, and 1 scalability note.

4 comments (2 blocking, 1 recommendation, 1 suggestion).

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for few blocking comments called out by @tvaron3 and @analogrelay - Ia m ok to sign-off assuming these issues get addressed

@kundadebdatta kundadebdatta requested a review from tvaron3 March 16, 2026 18:46
@kundadebdatta kundadebdatta enabled auto-merge (squash) March 16, 2026 21:28
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides one small detail. The spec should have some notes around stored procedure so that it is only applicable when targeted to documents and query plan should not be in ppcb path. These aren't directly applicable now but it could help missing these changes once query plan or other resources are added to to the sdk. This can be done in follow up pr.

@kundadebdatta kundadebdatta merged commit ebf2e86 into release/azure_data_cosmos-previews Mar 16, 2026
18 checks passed
@kundadebdatta kundadebdatta deleted the users/kundadebdatta/add_ppaf_spec_for_driver branch March 16, 2026 22:00
@github-project-automation github-project-automation bot moved this from Changes Requested to Done in CosmosDB Go/Rust Crew Mar 16, 2026
analogrelay pushed a commit to analogrelay/azure-sdk-for-rust that referenced this pull request Mar 17, 2026
…cs (Azure#3880)

## Description

Introduces the **Per-Partition Automatic Failover (PPAF) & Per-Partition
Circuit Breaker (PPCB)** design spec for `azure_data_cosmos_driver`.

This spec describes partition-level failover mechanisms that complement
the existing account-level failover in the driver's 7-stage operation
pipeline. Instead of marking an entire region unavailable when a single
partition becomes unhealthy, only the affected partition is routed to an
alternate region — preserving local latency for healthy partitions.

### What's in the spec

- **Two complementary mechanisms**:
- **PPAF** — per-partition failover for writes on single-master
accounts, triggered by 403/3, 503, 429/3092, 410
- **PPCB** — per-partition circuit breaker for reads (any account) and
writes on multi-master accounts, threshold-gated
- **Component design**: `PartitionEndpointState`,
`PartitionFailoverEntry`, `PartitionFailoverConfig` — all managed via
the driver's existing lock-free CAS pattern (no `RwLock<HashMap>` like
the SDK)
- **Operation pipeline integration**: How partition-level overrides plug
into `resolve_endpoint()` (Stage 2), `evaluate_transport_result()`
(Stage 5), and `LocationStateStore::apply()` (Stage 6)
- **Background failback loop**: Periodic sweep that expires stale
partition overrides, spawned via `BackgroundTaskManager` (Azure#3945)
- **Status code handling matrix**: Complete mapping of HTTP
status/sub-status codes to emitted `LocationEffect`s
- **Configuration surface**: All thresholds and intervals configurable
via environment variables
- **Test coverage plan**: Pure routing system tests, eligibility tests,
circuit breaker counter tests, integration tests, and end-to-end
operation loop tests
- **Prerequisites**: Missing pieces that must be implemented (partition
key range ID availability, `ResourceType.is_partitioned()`, env var
reading, `sync_account_properties` integration)

### Key design decisions

| Decision | Rationale |
|----------|-----------|
| Immutable CAS snapshots (not `RwLock<HashMap>`) | Follows driver's
existing lock-free pattern; eliminates reader/writer contention on hot
path |
| Two separate maps (PPAF vs PPCB) | Avoids cross-contamination between
single-master write failover and multi-master circuit breaker routing
strategies |
| Plain counters (not `AtomicI32`) | Entire `PartitionEndpointState` is
swapped atomically via CAS — no need for interior atomic counters |
| Failback sweeps both maps | Improvement over SDK which only sweeps
PPCB; trivial with immutable-snapshot pattern |
| `BackgroundTaskManager` for failback loop | Provides abort-on-drop,
panic safety, and graceful shutdown (Azure#3945) |
| Acceptable CAS counter loss under contention | Delays threshold
trigger by at most one failure — better trade-off than introducing locks
|

### Dependencies

- Azure#3945 — `BackgroundTaskManager` (must merge first; spec references it
for failback loop spawning)

### Files

| File | Action |
|------|--------|
| `azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md` |
**New** |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants