-
Notifications
You must be signed in to change notification settings - Fork 342
Cosmos: SDK to Driver migration Spec #4069
Description
After merging 4053 we have left an outline of the work that is needed to migrate a single request over to use the driver, which can now be extrapolated to the remaining requests in the SDK.
This issue tracks the work that would be needed for that. Below is the final version of the spec that was generated as part of the PR above further breaking down the flow of the SDK to the driver now.
SDK-to-Driver Cutover Guide
Overview
This document is the migration guide for routing all azure_data_cosmos SDK operations through the azure_data_cosmos_driver execution engine, replacing the legacy gateway pipeline path. The read_item cutover (PR #4053) is complete and serves as the reference pattern for all subsequent operations.
Background
The Cosmos SDK historically had two separate execution paths:
- Gateway pipeline (
azure_data_cosmos): The SDK handled auth, routing, retries, and request construction viaCosmosRequest→GatewayPipeline→ HTTP. - Driver (
azure_data_cosmos_driver): A newer execution engine with its own transport, routing, and operation model (CosmosOperation+OperationOptions).
PR #4005 bridged the two worlds by having ContainerClient::new() call driver.resolve_container() for eager metadata resolution. PR #4053 took the next step by routing read_item through the driver, establishing the pattern documented here.
Goal
Make the SDK client a thin wrapper over the driver. The SDK translates public-facing types into driver concepts, delegates execution, and translates the response back. All real work (auth, routing, retries, transport) happens inside driver.execute_operation().
Current State
ContainerClient::read_item is the only operation routed through the driver. All other operations — item CRUD (create_item, delete_item, replace_item, upsert_item), queries (query_items, query_databases, query_containers), database/container CRUD, and throughput operations — still use the gateway pipeline. This document describes the established pattern and how to apply it to each remaining operation.
Architecture
Data Flow (Established by read_item)
User calls: container_client.read_item(pk, id, options)
│
┌─────────▼────────────┐
│ SDK ContainerClient │
└─────────┬────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
PartitionKey ItemOptions ContainerRef
(SDK type) (SDK type) (driver type,
│ │ stored on client)
│ │ │
▼ ▼ ▼
into_driver_pk() item_options_to_ ItemReference::
│ operation_options() from_name()
│ │ │
└───────────────────┼───────────────────┘
│
┌─────────▼──────────┐
│ CosmosOperation:: │
│ read_item() │
└─────────┬──────────┘
│
┌─────────▼───────────┐
│ driver.execute_ │
│ operation(op, opts)│
│ │
│ (auth, routing, │
│ retries, HTTP) │
└─────────┬───────────┘
│
┌─────────▼───────────┐
│ driver_response_ │
│ to_cosmos_response │
└─────────┬───────────┘
│
┌─────────▼───────────┐
│ CosmosResponse<T> │
│ (SDK public type) │
└─────────────────────┘
Key Principle
The SDK's public API does not change. Each operation retains the same signature, return type, and observable behavior. This is a pure internal refactor.
Design Decision: Driver as Required Infrastructure
An alternative approach was explored where the driver is optional — stored as Option<Arc<CosmosDriver>> on CosmosClient, DatabaseClient, and ContainerClient. In that model, each operation checks at runtime whether a driver is available: if so, it takes the driver path; otherwise, it falls back to the legacy gateway pipeline. Container metadata resolution is also optional and failure is silently ignored.
This approach was rejected. The driver is required:
CosmosClientstoresArc<CosmosDriver>(notOption).ContainerClient::new()eagerly resolves container metadata via the driver and returnsResult— if resolution fails, the client cannot be created.- Operations have a single codepath through the driver, with no gateway fallback.
Rationale
The purpose of this cutover is to validate that the driver can fully replace the gateway pipeline for each operation. A fallback path undermines that goal:
- Testability: If the driver path can silently fall back to the gateway, we can't be 100% sure that the driver path is exercised in production or tests. Failures would be hidden rather than surfaced.
- Correctness: A dual-codepath design requires maintaining behavioral parity between two implementations indefinitely. A single path is easier to reason about, test, and debug.
- Options fidelity: A fallback path tempts skipping the options translation (e.g., passing empty
OperationOptionson the driver path), which silently drops user-configured session tokens, etags, and excluded regions. - Response fidelity: A minimal fallback implementation may skip reconstructing response headers from the driver's typed response, causing callers to get
Noneforrequest_charge(),session_token(), andetag().
The cutover is intentionally incremental — one operation at a time. Operations that haven't been cut over yet continue using the gateway pipeline naturally (they don't call the driver). This gives us the gradual rollout benefit without the complexity of runtime branching within a single operation.
Type Translation Patterns
The read_item cutover established the following type translation patterns. These same patterns apply to all subsequent operations.
PartitionKey (SDK → Driver)
The SDK and driver define separate PartitionKey types with identical structure but in different crates. Both represent a JSON array of typed values (string, number, bool, null).
Approach: Added into_driver_partition_key() on the SDK's PartitionKey that maps each InnerPartitionKeyValue variant to the driver's PartitionKeyValue.
Driver change required: Made PartitionKeyValue pub (was pub(crate)) so the SDK crate can construct Vec<PartitionKeyValue> for the conversion.
Future consideration: Once Ashley's options alignment work unifies these types, this conversion can be eliminated, and we can just use the Driver's definitions the way we did with the ContainerReference.
// SDK partition_key.rs
pub(crate) fn into_driver_partition_key(self) -> driver::PartitionKey {
let driver_values: Vec<DriverPKV> = self.0.into_iter()
.map(|v| match v.0 {
InnerPartitionKeyValue::String(s) => DriverPKV::from(s),
InnerPartitionKeyValue::Number(n) => DriverPKV::from(n),
InnerPartitionKeyValue::Bool(b) => DriverPKV::from(b),
InnerPartitionKeyValue::Null => DriverPKV::from(Option::<String>::None),
// ...
})
.collect();
DriverPK::from(driver_values)
}ItemOptions → OperationOptions
The SDK's ItemOptions (item-scoped request options) maps to the driver's OperationOptions field-by-field. The types in each field differ between crates, so values are bridged via their string representations.
SDK ItemOptions field |
Driver OperationOptions |
Conversion |
|---|---|---|
session_token: Option<SessionToken> |
.with_session_token() |
DriverSessionToken::new(token.to_string()) |
if_match_etag: Option<Etag> |
.with_etag_condition() |
Precondition::if_match(ETag::new(etag.to_string())) |
custom_headers: HashMap<...> |
.with_custom_headers() |
Passed through directly (types are the same) |
excluded_regions: Option<Vec<RegionName>> |
.with_excluded_regions() |
Region::new(name.to_string()) for each |
content_response_on_write_enabled: bool |
Ignored for reads | Driver always returns body for point reads |
content_response_on_write_enabled for write operations: The SDK defaults this to false, which causes write operations (create_item, replace_item, upsert_item) to send the Prefer: return=minimal header, suppressing the response body. When cutting over write operations, this must be translated to the driver's ContentResponseOnWrite enum (Enabled / Disabled) on OperationOptions. The driver has content_response_on_write: Option<ContentResponseOnWrite> which supports the same behavior. The mapping is:
- SDK
false(default) →ContentResponseOnWrite::Disabled→ sendsPrefer: return=minimal - SDK
true→ContentResponseOnWrite::Enabled→ omits thePreferheader, allowing the service to return the written item
For delete_item, the option is ignored by the Cosmos API (deleted items are never returned), but the SDK still applies the same header logic for consistency.
Driver change required: Added custom_headers support to OperationOptions (new field, setter, getter) and wired it into build_transport_request in operation_pipeline.rs. Custom headers may be removed in the future as we analyze which options are truly needed.
Response Bridge (Driver → SDK)
The driver returns an untyped CosmosResponse { body: Vec<u8>, headers: CosmosResponseHeaders, status: CosmosStatus }. The SDK returns a typed CosmosResponse<T> wrapping azure_core::Response<T>.
Approach: Reconstruct the SDK response from driver parts:
pub(crate) fn driver_response_to_cosmos_response<T>(
driver_response: DriverResponse,
) -> CosmosResponse<T> {
let status_code = driver_response.status().status_code();
let headers = cosmos_response_headers_to_headers(driver_response.headers());
let body = driver_response.into_body();
let raw = RawResponse::from_bytes(status_code, headers, Bytes::from(body));
let typed: Response<T> = raw.into();
CosmosResponse::new(typed, None)
}The header conversion maps each typed CosmosResponseHeaders field back to its raw header name/value pair (reverse of the driver's from_headers() parser).
Caveat: Only headers that the driver explicitly parses are preserved. The following 10 headers are converted: activity ID, request charge, session token, etag, continuation, item count, substatus, server duration (request duration ms), index metrics, and query metrics. Any other server headers are lost. This covers all standard Cosmos response metadata. We will probably come back to this when we do the work on verifying the headers we want.
CosmosRequest → Optional
The SDK's CosmosResponse<T> previously held the original CosmosRequest — a gateway pipeline concept with no driver equivalent. The driver uses CosmosOperation + OperationOptions instead, which are consumed during execution.
Decision: Made the request field Option<CosmosRequest>:
- Gateway-routed operations (all methods not yet cut over) continue setting
Some(request). - Driver-routed operations set
None. - The field is only accessed behind
#[cfg(feature = "fault_injection")]and marked#[allow(dead_code)]. - A TODO comment marks it for removal once all operations are on the driver.
Structural Changes (from read_item cutover)
ContainerClient
The read_item cutover added two fields to ContainerClient so driver-routed operations can reach the driver at execution time:
pub struct ContainerClient {
// ... existing fields ...
driver: Arc<CosmosDriver>, // retained from new()
container_ref: ContainerReference, // cloned before passing to ContainerConnection
}Previously, the driver was discarded after new() and ContainerReference was buried inside ContainerConnection.
driver_bridge Module
Private module at src/driver_bridge.rs containing the shared conversion functions used by all driver-routed operations:
driver_response_to_cosmos_response<T>()— response conversionitem_options_to_operation_options()— options translationdriver_response_headers_to_headers()— converts the driver's typed response headers (e.g.,activity_id: Option<ActivityId>,request_charge: Option<RequestCharge>) into rawazure_core::Headerskey-value pairs for the SDK response
This module is the shared foundation for all operation cutover. When cutting over create_item, delete_item, etc., reuse these same bridge functions.
Configuration and Options Flow
This section documents how SDK-level configuration reaches the driver, what's wired today, and what gaps need to be addressed as more operations are cut over.
Driver's Layered Options Model
The driver resolves per-operation configuration through a four-layer hierarchy, where each layer overrides the previous:
Environment variables (lowest priority)
↓
Runtime defaults (set on CosmosDriverRuntimeBuilder)
↓
Account/Driver defaults (set on CosmosDriver)
↓
Per-operation options (passed to execute_operation) ← highest priority
At execution time, OperationOptionsView::new(env, runtime, account, operation) resolves each field by walking the layers top-down. The driver's OperationOptions supports these fields:
| Field | Env var | Description |
|---|---|---|
read_consistency_strategy |
AZURE_COSMOS_READ_CONSISTENCY_STRATEGY |
Session vs eventual reads |
excluded_regions |
— | Regions to avoid |
content_response_on_write |
AZURE_COSMOS_CONTENT_RESPONSE_ON_WRITE |
Enabled/Disabled |
throughput_control_group_name |
— | Rate limiting group |
end_to_end_latency_policy |
— | Timeout management |
max_failover_retry_count |
AZURE_COSMOS_MAX_FAILOVER_RETRY_COUNT |
Failover retry budget |
endpoint_unavailability_ttl |
— | Endpoint cooldown period |
session_capturing_disabled |
— | Disable session token capture |
max_session_retry_count |
AZURE_COSMOS_MAX_SESSION_RETRY_COUNT |
Session retry budget |
Current Wiring (from read_item cutover)
CosmosClientBuilder::build() constructs the driver runtime with mostly defaults:
CosmosClientBuilder::build()
│
├── SDK pipeline (GatewayPipeline, auth, etc.)
│
└── CosmosDriverRuntimeBuilder::new()
│
├── .with_fault_injection_rules(...) ← only SDK→driver config today
│
└── .build() → CosmosDriverRuntime
│
└── .get_or_create_driver(account, None) → CosmosDriver
Per-operation, read_item constructs OperationOptions fresh each call:
read_item(pk, id, options: ItemOptions)
│
└── item_options_to_operation_options(&options)
│
├── excluded_regions → driver ExcludedRegions
└── custom_headers → driver custom_headers
(session_token and etag go on CosmosOperation, not OperationOptions)
The resulting OperationOptions is passed directly to driver.execute_operation(operation, driver_options) as the operation-layer options. There is no client-level base OperationOptions merged in.
Gaps to Address
The following SDK-level configuration is not wired into the driver today:
-
CosmosClientOptionsfields not forwarded to the driver runtime:user_agent_suffix— the driver runtime builder has.with_user_agent_suffix()but the SDK doesn't call itapplication_region— the SDK uses this for its own routing strategy but doesn't pass it to the driver runtime as a preferred regioncustom_headers(client-level) — the SDK's client-level custom headers are not set as runtime defaults; only per-operation custom headers are bridged
-
No client-level base
OperationOptions: The SDK doesn't set runtime-level or account-level operation defaults on the driver. This means driver features configured viaCosmosDriverRuntimeBuilder::with_operation_options()(e.g.,read_consistency_strategy,max_failover_retry_count) are only reachable through environment variables or per-callOperationOptions. If the SDK needs to expose these asCosmosClientOptionsfields, it must wire them into the driver's runtime or account layer. -
content_response_on_writenot bridged per-call: As noted in the type translation section, the SDK'scontent_response_on_write_enabled: boolneeds to be translated to the driver'sContentResponseOnWriteenum initem_options_to_operation_options()for write operations. -
Connection pool and transport options:
CosmosDriverRuntimeBuilderaccepts.with_client_options(ClientOptions)and.with_connection_pool(ConnectionPoolOptions)but the SDK doesn't forward any transport configuration today. -
No way to inject a pre-configured driver:
CosmosClientBuilderalways creates its ownCosmosDriverRuntimeandCosmosDriverinternally — there is nowith_driver()method to accept a driver that the caller has already configured. This means:- Users who want fine-grained control over driver options (connection pool tuning, operation defaults, throughput control groups, workload/correlation IDs) have no way to set them through the SDK today.
- Each
CosmosClientcreates its ownCosmosDriverRuntime, duplicating background tasks, connection pools, and caches. There is already a TODO and tracking issue (#3908) noting that the runtime should be shared across clients targeting the same account.
A
with_driver(Arc<CosmosDriver>)method onCosmosClientBuilderwould solve both problems. TheCosmosDriveralready holds anArc<CosmosDriverRuntime>internally, so injecting a driver implicitly brings along all runtime-level configuration (connection pool, user agent, base operation options, throughput control groups). The driver itself carries account-levelDriverOptions(including its ownOperationOptionslayer). This means the full options hierarchy is captured by a singleArc<CosmosDriver>:Arc<CosmosDriver> ├── DriverOptions │ ├── AccountReference (endpoint + credential) │ └── OperationOptions (account-level defaults) └── Arc<CosmosDriverRuntime> ├── ConnectionPoolOptions ├── UserAgent / WorkloadId / CorrelationId ├── OperationOptions (runtime-level defaults) ├── ThroughputControlGroupRegistry └── env OperationOptions (from environment variables)Usage:
// Configure runtime-level options (shared across accounts/clients) let runtime = CosmosDriverRuntimeBuilder::new() .with_user_agent_suffix("my-app") .with_connection_pool(pool_opts) .with_operation_options(runtime_base_opts) .build() .await?; // Configure account-level options let driver_opts = DriverOptions::builder(account) .with_operation_options(account_level_opts) .build(); let driver = runtime.get_or_create_driver(account, Some(driver_opts)).await?; // Pass the fully-configured driver into the SDK client let client = CosmosClient::builder() .with_driver(driver) .build(endpoint, credential, None) .await?;
When
with_driver()is set,build()should skip creating its own runtime and driver, and use the provided one directly. The SDK still creates its own gateway pipeline for operations not yet cut over. This also naturally solves the runtime-sharing problem: multipleCosmosClientinstances can share the sameCosmosDriverRuntimeby passing in drivers created from the same runtime.
These gaps don't affect read_item correctness (the driver's defaults and env var fallbacks work), but they will matter as more operations are cut over and users expect SDK-level configuration to propagate to the driver.
How to Cut Over an Operation
To cut over another item operation (e.g., create_item), follow this template:
-
Build the operation: Use the appropriate
CosmosOperation::*factory method (e.g.,CosmosOperation::create_item(container_ref, pk)). -
Attach the body: For write operations, serialize the item to bytes and call
.with_body(bytes)on the operation. -
Wire session token and etag: These live on
CosmosOperation, notOperationOptions. Set them inline before executing:if let Some(session_token) = options.session_token() { operation = operation.with_session_token(session_token.to_string()); } if let Some(etag) = options.if_match_etag() { operation = operation.with_precondition( Precondition::if_match(ETag::new(etag.to_string())), ); }
This is separate from the bridge function because Ashley's options alignment (Cosmos: Options Alignment Step 1 of 2 - Align Driver Options with spec #4055) moved session token and etag to
CosmosOperation(the operation itself carries per-request state, whileOperationOptionscarries cross-cutting config). -
Translate options: Reuse
item_options_to_operation_options()fromdriver_bridge.rs. This handlesexcluded_regionsandcustom_headers. -
Execute: Call
self.driver.execute_operation(operation, driver_options).await?. -
Bridge response: Call
driver_response_to_cosmos_response(driver_response)to get aCosmosResponse<T>, then wrap it in the appropriate public response type (e.g.,ItemResponse::new(cosmos_response)for item operations,ResourceResponse::new(cosmos_response)for resource operations).
The public method signature should not change.
Response Type Wrapping
PR #3960 introduced dedicated public response types that wrap the internal CosmosResponse<T>. When cutting over an operation, use the appropriate wrapper:
| Public Type | Used For | Extra Fields |
|---|---|---|
ItemResponse<T> |
create/read/replace/upsert/delete item | etag() |
ResourceResponse<T> |
create/read/delete database/container | — |
BatchResponse |
transactional batch | etag() |
QueryFeedPage<T> |
query operations | index_metrics(), query_metrics() |
CosmosResponse<T> is now pub(crate). The bridge function driver_response_to_cosmos_response() returns CosmosResponse<T>, and the caller wraps it:
// In read_item:
Ok(ItemResponse::new(
crate::driver_bridge::driver_response_to_cosmos_response(driver_response),
))
// In a future create_container:
Ok(ResourceResponse::new(
crate::driver_bridge::driver_response_to_cosmos_response(driver_response),
))CosmosResponse has two constructors:
new(response, request)— for gateway-routed operations (has aCosmosRequest)from_response(response)— for driver-routed operations (noCosmosRequest, setsrequest: None)
Both constructors parse CosmosResponseHeaders from the raw HTTP headers and build CosmosDiagnostics (activity ID, server duration) automatically. The bridge's driver_response_headers_to_headers() ensures the driver's typed headers are converted back to raw headers so the SDK's parsing works correctly.
request_url() and Fault Injection Tests
ItemResponse::request_url() returns Option<Url> — None for driver-routed operations, Some(url) for gateway-routed operations. Other response types (ResourceResponse, BatchResponse) return Url directly since they are always gateway-routed.
For fault injection tests that verify failover endpoints:
- Gateway-routed operations: use
.request_url().expect("...") - Driver-routed operations: use
if let Some(url) = response.request_url() { ... }
This means failover endpoint assertions are silently skipped for driver-routed reads. Once driver diagnostics expose the effective endpoint (tracked as future work), these assertions should be restored.
Driver Response Does Not Expose the Effective Endpoint
The driver's CosmosResponse returns the response body, headers, and status — but does not expose which endpoint (URL or region) was ultimately used to serve the request through the SDK's public API. This information is critical for:
- Failover verification tests — asserting that a request was routed to the expected region after a fault-triggered failover
- Diagnostics and observability — understanding which region served a request for debugging and performance analysis
The gateway pipeline tracked this via CosmosRequest (which held the final URL). The driver handles routing internally in the operation pipeline (resolve_endpoint → RoutingDecision) but does not propagate the resolved endpoint back through the SDK response.
Important: The driver already captures this information in its DiagnosticsContext. Each RequestDiagnostics entry records region: Option<Region> and endpoint: String for every attempt (including retries and failovers), along with ExecutionContext (Initial, Retry, RegionFailover, TransportRetry). The gap is not in the driver itself — it's in the SDK's CosmosDiagnostics type, which currently only exposes activity_id() and server_duration_ms() and does not forward the driver's per-request diagnostics.
Future work: The SDK's CosmosDiagnostics should expose the driver's RequestDiagnostics (or a subset of it) so that tests and users can inspect routing decisions. Once this is done:
- Failover tests can assert
response.diagnostics().requests().last().region()instead ofresponse.request_url().host_str() - The
request_url()→Optionworkaround can be removed entirely - Users get richer observability than a single URL — they see every attempt, region, and retry context
Tests with skipped endpoint assertions (these should be restored once the driver exposes the effective endpoint):
| Test File | Test Name | What it verifies |
|---|---|---|
cosmos_items.rs |
assert_response helper (all item tests) |
Endpoint matches expected host |
cosmos_fault_injection.rs |
fault_injection_429_retry_with_hit_limit |
Endpoint matches hub |
cosmos_multi_write_retry_policies.rs |
read_cross_region_retry_on_408 |
Failover to satellite region |
cosmos_multi_write_retry_policies.rs |
read_cross_region_retry_on_500 |
Failover to satellite region |
cosmos_multi_write_fault_injection.rs |
fault_injection_read_unaffected_by_create_rule |
Endpoint matches hub |
cosmos_multi_write_fault_injection.rs |
fault_injection_read_region_retry_503 |
Failover to satellite region |
cosmos_multi_write_fault_injection.rs |
fault_injection_read_session_retry_404_1002 |
Failover to hub region |
cosmos_multi_write_fault_injection.rs |
fault_injection_read_connection_error_failover |
Failover to satellite region |
cosmos_multi_write_fault_injection.rs |
fault_injection_read_response_timeout_retries_to_satellite |
Failover to satellite region |
cosmos_multi_write_fault_injection.rs |
fault_injection_connection_error_local_retry_succeeds |
Stays on hub (no failover) |
Files Changed in read_item Cutover (PR #4053)
This section records the files modified by the initial read_item cutover for reference. Subsequent operation cutover PRs will touch a subset of these (primarily container_client.rs and potentially driver_bridge.rs).
| File | Change |
|---|---|
azure_data_cosmos_driver/src/options/operation_options.rs |
Added custom_headers field + setter/getter |
azure_data_cosmos_driver/src/driver/pipeline/operation_pipeline.rs |
Wired custom headers into request construction |
azure_data_cosmos_driver/src/models/partition_key.rs |
Made PartitionKeyValue pub |
azure_data_cosmos_driver/src/models/mod.rs |
Re-exported PartitionKeyValue |
azure_data_cosmos/src/driver_bridge.rs |
New — shared conversion module |
azure_data_cosmos/src/clients/container_client.rs |
Added driver/container_ref fields; rewrote read_item |
azure_data_cosmos/src/models/cosmos_response.rs |
Made request field optional |
azure_data_cosmos/src/partition_key.rs |
Added into_driver_partition_key() |
azure_data_cosmos/src/options/mod.rs |
Added pub(crate) accessors for bridge |
azure_data_cosmos/src/pipeline/mod.rs |
Updated CosmosResponse::new call site |
azure_data_cosmos/src/lib.rs |
Registered mod driver_bridge |
Remaining Work
-
Options alignment: Ashley is working on aligning SDK options with the driver's options model (PR Cosmos: Options Alignment Step 1 of 2 - Align Driver Options with spec #4055). Once complete, the
ItemOptions→OperationOptionstranslation may simplify or become unnecessary. -
PartitionKey unification: The dual
PartitionKeytypes andinto_driver_partition_key()conversion should be eliminated once the types are unified. -
CosmosRequestremoval: Once all operations are routed through the driver, theOption<CosmosRequest>field onCosmosResponse<T>can be removed entirely. -
custom_headersreview: Thecustom_headersfield onOperationOptionswas added for feature parity. It may be removed as we analyze which options are truly needed at the driver level. -
Remaining item operations:
create_item,delete_item,replace_item,upsert_item, and query operations (query_items) need to be cut over following the pattern above. Note thatquery_itemscurrently usesQueryExecutor+ gateway pipeline and has a fundamentally different flow (pagination viaQueryFeedPage) that will need special bridge logic. -
Database and container CRUD operations: The following operations on
CosmosClientandDatabaseClientare still gateway-routed and will need to be cut over:CosmosClient:create_database,query_databasesDatabaseClient:read(database),create_container,query_containers,delete(database),read_throughput,replace_throughputContainerClient:read(container),replace(container),delete(container),read_throughput,replace_throughput
These use
ResourceResponse(notItemResponse) and may have different options types, but the bridge pattern (build operation → translate options → execute → bridge response) should apply similarly. Throughput operations go throughOffersClient, which usesQueryExecutorfor reads andCosmosRequestfor writes.
Fault Injection Wiring
The read_item cutover required connecting the SDK's fault injection system to the driver's. This section documents the established wiring so that future operation cutover PRs do not need to repeat it.
Problem (Resolved)
The SDK and driver each have their own fault injection module (azure_data_cosmos::fault_injection and azure_data_cosmos_driver::fault_injection). They define parallel but separate types (FaultInjectionRule, FaultInjectionCondition, FaultInjectionResult, etc.) with identical variants but different Rust types. Prior to the read_item cutover, only the gateway pipeline received fault injection rules — the driver was built without them.
Solution: Rule Translation with Shared State (Established)
The bridge module (driver_bridge.rs) includes sdk_fi_rules_to_driver_fi_rules(), which translates SDK fault injection rules into driver fault injection rules. The translation covers:
FaultOperationType— variant-by-variant match (identical variant names)FaultInjectionErrorType— variant-by-variant matchFaultInjectionCondition—RegionName→Region, operation type and container ID mapped directlyFaultInjectionResult—Duration→Option<Duration>, probability copied- Timing fields —
start_time: Instant→Option<Instant>,end_timeandhit_limitcopied
Shared Mutable State
SDK FaultInjectionRule has enabled: Arc<AtomicBool> and hit_count: Arc<AtomicU32> that tests mutate at runtime (.disable(), .enable(), .hit_count()). The driver's FaultInjectionRuleBuilder accepts external Arcs via with_shared_state(), so both the SDK gateway path and the driver path reference the same atomic state. This means:
- Calling
.disable()on the SDK rule also disables it in the driver - Hit counts are shared — both paths increment the same counter
- Tests that toggle rules or assert hit counts work correctly across both paths
Wiring in CosmosClientBuilder
In CosmosClientBuilder::build():
- Before the
FaultInjectionClientBuilderis consumed for the gateway transport,rules()extracts a reference to the SDK rules sdk_fi_rules_to_driver_fi_rules()translates them to driver rules with shared state- The translated rules are passed to
CosmosDriverRuntimeBuilder::with_fault_injection_rules() - The SDK's
fault_injectionCargo feature now forwards to the driver'sfault_injectionfeature
Test Patterns for Subsequent Cutover
When cutting over additional operations, no additional fault injection wiring is needed — it was handled once at the CosmosClientBuilder level in PR #4053. However, tests need to account for two behavioral differences:
request_url() returns None for driver-routed operations:
// Gateway-routed operations return Some(url)
// Driver-routed operations return None
if let Some(url) = response.request_url() {
assert_eq!(url.host_str().unwrap(), expected_endpoint);
}Hit-count asymmetry between gateway and driver paths:
The driver retries certain errors internally (e.g., 500 on reads triggers up to 3 failover retries). Each retry attempt evaluates fault injection rules independently, so a single SDK-level read_item call can consume up to 4 fault injection hits (initial + 3 retries). In contrast, the gateway path typically consumes 1 hit per SDK call.
When writing hit_limit-based tests for driver-routed operations, multiply the expected hits per call by the driver's retry budget:
// Each read_item call consumes up to 4 hits (1 initial + 3 failover retries).
// For 2 calls to fail: 2 × 4 = 8 hits.
let rule = FaultInjectionRuleBuilder::new("test", error)
.with_hit_limit(8) // not 2 or 4
.build();This asymmetry will disappear once all operations are driver-routed, since there will be only one hit-counting path.
custom_response Translation (Not Yet Implemented)
Translation of CustomResponse (synthetic HTTP responses) is not yet implemented. None of the current tests use custom responses for ReadItem operations. When needed, the bridge function should be extended to translate CustomResponse fields (status_code, headers, body).
Consolidating to Driver Fault Injection (After Full Cutover)
The current dual-system architecture (SDK fault injection + driver fault injection + translation bridge) exists only because the cutover is incremental — some operations still go through the gateway while others go through the driver. Once all operations are routed through the driver (see Post-Cutover Cleanup Checklist):
-
Drop
azure_data_cosmos::fault_injection— the SDK's HTTP-client-level fault interception module becomes unreachable. Delete the entiresrc/fault_injection/directory. -
Re-export driver types — the SDK re-exports the driver's fault injection types directly:
#[cfg(feature = "fault_injection")] pub use azure_data_cosmos_driver::fault_injection;
-
Remove the translation layer —
sdk_fi_rules_to_driver_fi_rules()indriver_bridge.rsand theshared_enabled()/shared_hit_count()accessors on the SDK rule are no longer needed. -
Simplify
CosmosClientBuilder—with_fault_injection()acceptsVec<Arc<driver::FaultInjectionRule>>directly and passes them toCosmosDriverRuntimeBuilder::with_fault_injection_rules(). No translation, no cloning, no intermediary builder. -
Update tests — tests construct driver
FaultInjectionRuledirectly (same builders, same API) instead of SDK rules.
At that point the SDK has no fault injection logic of its own — it's a pass-through to the driver, matching the overall "SDK as thin wrapper" goal. The driver is the single source of truth for all transport-related concerns including fault injection.
Post-Cutover Cleanup Checklist
The cutover introduces several interim artifacts that exist only because the migration is incremental. This section consolidates what can be removed and when, organized by trigger.
After options alignment completes
These can be cleaned up once Ashley's options alignment work (PR #4055 and follow-ups) unifies SDK and driver option/model types:
- Remove
into_driver_partition_key()— the SDK'sPartitionKeytype and the driver'sPartitionKeytype should be unified. Once they are, the variant-by-variant mapping inpartition_key.rsand thepubvisibility change onPartitionKeyValueare no longer needed. - Simplify
item_options_to_operation_options()— ifItemOptionsandOperationOptionsconverge, the field-by-field bridge indriver_bridge.rscan be reduced or eliminated.
After SDK diagnostics exposes driver RequestDiagnostics
These can be cleaned up once CosmosDiagnostics forwards the driver's per-request diagnostics (region, endpoint, execution context):
- Remove
request_url() → Optionworkaround —ItemResponse::request_url()currently returnsOption<Url>because driver-routed operations have noCosmosRequest. Once diagnostics expose the effective endpoint, this method can be removed entirely in favor ofresponse.diagnostics().requests().last().region(). - Restore skipped endpoint assertions in tests — the tests listed in the "Tests with skipped endpoint assertions" table currently use
if let Some(url) = response.request_url(), silently skipping the assertion for driver-routed operations. These should be rewritten to assert against the diagnostics endpoint.
After all item operations are driver-routed
These can be cleaned up once create_item, delete_item, replace_item, upsert_item, and query_items are all routed through the driver:
- Remove hit-count multiplier in fault injection tests — tests currently multiply
hit_limitby 4 to account for the driver's internal retry budget (e.g.,hit_limit(8)instead ofhit_limit(2)). Once all item operations go through the driver, there is only one hit-counting path and the expected counts become straightforward. - Remove
ContainerConnection— theContainerConnectiontype inhandler/container_connection.rswrapsGatewayPipelinefor item operations. Once all item operations use the driver, this type is no longer needed for item routing (it may still be needed for container-level CRUD until those are also cut over).
After all operations are driver-routed (full cutover)
These require every operation — item CRUD, database/container CRUD, query, and throughput — to be routed through the driver:
- Remove
Option<CosmosRequest>fromCosmosResponse<T>— therequestfield was made optional to support the interim state where some operations have aCosmosRequestand others don't. Once no operation produces aCosmosRequest, the field can be removed entirely. - Remove
CosmosRequestandCosmosResponse::new(response, request)— the gateway-oriented constructor and theCosmosRequesttype itself become dead code. OnlyCosmosResponse::from_response(response)remains. - Remove
GatewayPipeline— the entire gateway execution path (pipeline/mod.rs,CosmosRequestbuilder, auth/routing/retry logic in the SDK) is superseded by the driver. This is the largest single cleanup. - Remove
ContainerConnection— if not already removed after item cutover, it can be fully deleted now. - Remove
QueryExecutor— the gateway-based query executor is replaced by driver query execution.query_itemsandquery_databases/query_containersall go through the driver. - Remove
OffersClient— throughput read/replace operations currently route through this gateway-based helper. Once throughput operations go through the driver, this client is dead code. - Drop
azure_data_cosmos::fault_injectionmodule — delete the entiresrc/fault_injection/directory. The SDK re-exports the driver's fault injection types directly. - Remove
sdk_fi_rules_to_driver_fi_rules()and shared-state accessors — the translation bridge indriver_bridge.rsandshared_enabled()/shared_hit_count()on the SDK rule are no longer needed. - Simplify
CosmosClientBuilder—with_fault_injection()accepts driverFaultInjectionRuledirectly. No translation, no intermediary builder. - Update all fault injection tests — tests construct driver
FaultInjectionRuledirectly instead of SDK rules. - Review
custom_headersonOperationOptions— this was added for feature parity with the gateway path. Determine whether it's still needed or can be removed at the driver level. - Clean up
driver_bridge.rs— after all the above, this module should contain onlydriver_response_to_cosmos_response()anddriver_response_headers_to_headers()(the response bridge). All options translation and fault injection translation code is gone.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status