Cosmos: Adds Per Partition Automatic Failover and Circuit Breaker Specs#3880
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new design/specification document describing Per-Partition Automatic Failover (PPAF) and Per-Partition Circuit Breaker (PPCB) behavior in the azure_data_cosmos SDK, including eligibility rules, retry-policy integration, background failback, and configuration surface.
Changes:
- Introduces a detailed markdown spec for PPAF/PPCB design and request flows.
- Documents status/substatus handling, configuration flags/env vars, and interaction with account-level failover.
- Provides proposed test coverage areas and sequence diagrams.
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
Review Summary — PPAF & PPCB SpecificationPR Intent: Adds a 1005-line design specification for Per-Partition Automatic Failover (PPAF) and Per-Partition Circuit Breaker (PPCB) mechanisms in the Cosmos DB Rust SDK. Overall Assessment: The spec is well-structured and covers the architecture comprehensively. However, there are several correctness discrepancies between the spec and the actual implementation that could mislead anyone implementing the driver crate from this document. These should be resolved before merge. Existing review comments: 5 (all from @copilot, all unresolved). My findings overlap with 3 of those; I have 9 new findings. Key Issues
|
|
@FabianMeiswinkel I've opened a new pull request, #3918, to work on those changes. Once the pull request is ready, I'll request review from you. |
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
|
@FabianMeiswinkel I've opened a new pull request, #3919, to work on those changes. Once the pull request is ready, I'll request review from you. |
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Few questions/comments and overall I would like to bette runderstand how this interacts with the DOP-based principles in the driver pipeline.
API Change CheckAPIView identified API level changes in this PR and created the following API reviews |
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
tvaron3
left a comment
There was a problem hiding this comment.
PR Deep Review
Reviewed the PPAF/PPCB spec for correctness against the Java SDK implementation. The spec is thorough and well-structured, with the driver's CAS-based architecture clearly documented. Found 2 correctness issues in resource type eligibility checks, 1 completeness gap in the status code matrix, and 1 scalability note.
4 comments (2 blocking, 1 recommendation, 1 suggestion).
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Show resolved
Hide resolved
…datta/add_ppaf_spec_for_driver
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md
Show resolved
Hide resolved
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
LGTM except for few blocking comments called out by @tvaron3 and @analogrelay - Ia m ok to sign-off assuming these issues get addressed
tvaron3
left a comment
There was a problem hiding this comment.
LGTM besides one small detail. The spec should have some notes around stored procedure so that it is only applicable when targeted to documents and query plan should not be in ppcb path. These aren't directly applicable now but it could help missing these changes once query plan or other resources are added to to the sdk. This can be done in follow up pr.
ebf2e86
into
release/azure_data_cosmos-previews
…cs (Azure#3880) ## Description Introduces the **Per-Partition Automatic Failover (PPAF) & Per-Partition Circuit Breaker (PPCB)** design spec for `azure_data_cosmos_driver`. This spec describes partition-level failover mechanisms that complement the existing account-level failover in the driver's 7-stage operation pipeline. Instead of marking an entire region unavailable when a single partition becomes unhealthy, only the affected partition is routed to an alternate region — preserving local latency for healthy partitions. ### What's in the spec - **Two complementary mechanisms**: - **PPAF** — per-partition failover for writes on single-master accounts, triggered by 403/3, 503, 429/3092, 410 - **PPCB** — per-partition circuit breaker for reads (any account) and writes on multi-master accounts, threshold-gated - **Component design**: `PartitionEndpointState`, `PartitionFailoverEntry`, `PartitionFailoverConfig` — all managed via the driver's existing lock-free CAS pattern (no `RwLock<HashMap>` like the SDK) - **Operation pipeline integration**: How partition-level overrides plug into `resolve_endpoint()` (Stage 2), `evaluate_transport_result()` (Stage 5), and `LocationStateStore::apply()` (Stage 6) - **Background failback loop**: Periodic sweep that expires stale partition overrides, spawned via `BackgroundTaskManager` (Azure#3945) - **Status code handling matrix**: Complete mapping of HTTP status/sub-status codes to emitted `LocationEffect`s - **Configuration surface**: All thresholds and intervals configurable via environment variables - **Test coverage plan**: Pure routing system tests, eligibility tests, circuit breaker counter tests, integration tests, and end-to-end operation loop tests - **Prerequisites**: Missing pieces that must be implemented (partition key range ID availability, `ResourceType.is_partitioned()`, env var reading, `sync_account_properties` integration) ### Key design decisions | Decision | Rationale | |----------|-----------| | Immutable CAS snapshots (not `RwLock<HashMap>`) | Follows driver's existing lock-free pattern; eliminates reader/writer contention on hot path | | Two separate maps (PPAF vs PPCB) | Avoids cross-contamination between single-master write failover and multi-master circuit breaker routing strategies | | Plain counters (not `AtomicI32`) | Entire `PartitionEndpointState` is swapped atomically via CAS — no need for interior atomic counters | | Failback sweeps both maps | Improvement over SDK which only sweeps PPCB; trivial with immutable-snapshot pattern | | `BackgroundTaskManager` for failback loop | Provides abort-on-drop, panic safety, and graceful shutdown (Azure#3945) | | Acceptable CAS counter loss under contention | Delays threshold trigger by at most one failure — better trade-off than introducing locks | ### Dependencies - Azure#3945 — `BackgroundTaskManager` (must merge first; spec references it for failback loop spawning) ### Files | File | Action | |------|--------| | `azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md` | **New** |
Description
Introduces the Per-Partition Automatic Failover (PPAF) & Per-Partition Circuit Breaker (PPCB) design spec for
azure_data_cosmos_driver.This spec describes partition-level failover mechanisms that complement the existing account-level failover in the driver's 7-stage operation pipeline. Instead of marking an entire region unavailable when a single partition becomes unhealthy, only the affected partition is routed to an alternate region — preserving local latency for healthy partitions.
What's in the spec
PartitionEndpointState,PartitionFailoverEntry,PartitionFailoverConfig— all managed via the driver's existing lock-free CAS pattern (noRwLock<HashMap>like the SDK)resolve_endpoint()(Stage 2),evaluate_transport_result()(Stage 5), andLocationStateStore::apply()(Stage 6)BackgroundTaskManager(Cosmos: IntroduceBackgroundTaskManagerTo Spawn Background Tasks Using Tokio Runtime #3945)LocationEffectsResourceType.is_partitioned(), env var reading,sync_account_propertiesintegration)Key design decisions
RwLock<HashMap>)AtomicI32)PartitionEndpointStateis swapped atomically via CAS — no need for interior atomic countersBackgroundTaskManagerfor failback loopDependencies
BackgroundTaskManagerTo Spawn Background Tasks Using Tokio Runtime #3945 —BackgroundTaskManager(must merge first; spec references it for failback loop spawning)Files
azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md