[Internal] Add dynamic timeout escalation spec#3871
[Internal] Add dynamic timeout escalation spec#3871tvaron3 wants to merge 14 commits intoAzure:release/azure_data_cosmos-previewsfrom
Conversation
Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing escalating connection and request timeouts on transport retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
Resolve 3 of 4 open questions based on review feedback: - Commit to Context-based approach (Option A) for delivering per-attempt timeouts, removing Options B/C. No changes needed in azure_core or typespec_client_core. - Decide hedged attempts always start at tier 0 of the timeout ladder since they target independent regional endpoints. - Record effective per-attempt timeout values in DiagnosticsContext for observability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the dynamic timeout spec to reflect the new pipeline architecture from PR Azure#3887 (Driver Transport step 02): - Replace MAX_TRANSPORT_RETRIES/TransportRetry references with the new FailoverRetry/SessionRetry model and failover_retry_count - Move retry loop references from cosmos_driver.rs to the 7-stage loop in operation_pipeline.rs - Change timeout delivery from Context type-map to TransportRequest field, matching the new pipeline data flow - Update goal from 'increase retry count' to 'leverage existing retry budget' since max_failover_retries already defaults to 3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
Propose adding an optional per-request timeout field to typespec_client_core::http::Request with set_timeout()/timeout() methods. The reqwest HttpClient implementation applies it via reqwest::RequestBuilder::timeout(). For connection timeout escalation, the reqwest client is built with the maximum ladder value (5s) and the per-request timeout provides the tighter overall bound on each attempt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove all connection timeout escalation from the spec. Connection timeouts remain static (configured via ConnectionPoolOptions). Only request timeouts are escalated per retry attempt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Connection timeouts use a failure-rate adaptive model instead of per-retry escalation: - Start at 1s (sufficient for any cloud/datacenter network) - ShardedHttpTransport monitors connection failure rate - On sustained failures, create new HttpClient instances with 5s connect_timeout (one-time persistent transition) - No azure_core changes needed — entirely internal to the sharded transport Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout) - Rename max_probe_retries to max_probe_attempts, fix off-by-one semantics with exclusive range (3 = 3 total attempts) primitives for runtime-agnostic design - Store CancellationToken in EndpointHealthMonitor, implement Drop to cancel background task on shutdown - Replace snapshot+swap with per-endpoint mark_available/ mark_unavailable calls to avoid lost concurrent updates - Update PR description to match current spec (no env vars, non-blocking startup, no optional blocking) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…om/Azure/azure-sdk-for-rust into tvaron3/dynamicTimeouts
tvaron3
left a comment
There was a problem hiding this comment.
PR Deep Review — §10 Adaptive Connection Timeout
Focused review of the adaptive connection timeout section. 3 comments (1 recommendation, 2 suggestions).
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
- Remove direct mode non-goal (won't be implemented) - Resolve connection failure threshold: >3 consecutive failures per endpoint triggers 1s to 5s escalation - Old shards are actively marked unhealthy on escalation for immediate reclamation by health sweep - Explicitly state per-endpoint failure tracking scope - Remove resolved open questions 2 and 3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ShardedHttpTransport is no longer 'future integration' — it exists in sharded_transport.rs with per-shard health sweeps - Fix HttpClientFactory::create() → build() - Fix HttpClientConfig connect_timeout claim — connect_timeout is sourced from ConnectionPoolOptions::max_connect_timeout() inside the factory build() method, not as a config field - Reference existing per-request timeout mechanism in transport pipeline (azure_core::sleep racing the HTTP future) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix section numbering (missing §9) - Add rationale for 65s tier-3 jump - Add cross-reference for ladder saturation in §5 - Add typespec_client_core cross-crate PR strategy - Clarify adaptive connection timeout: idempotent transition, per-endpoint aggregation, no-fallback rationale Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Major architectural changes to align with codebase patterns: - Move timeout escalation to transport pipeline level, indexed by transport-level timeout retries (not operation failover) - Use deadline-based pattern (Instant) instead of adding a Duration field to TransportRequest — feeds into existing remaining_request_timeout() mechanism - Remove TimeoutLadder struct; use free function + constants matching DOP style - Remove AttemptTimeoutDiagnostics struct; record flat field directly on RequestDiagnostics - Clarify sleep-race is enforcement mechanism; Request::set_timeout is long-term replacement (never both active) - Use AtomicBool/AtomicU32 for adaptive connection timeout state matching lock-free hot-path pattern - Reference PipelineType for ladder selection - Explicit clamping order: pool bounds first, deadline last - Add note that min_* timeout fields are unused placeholders Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds an internal design specification for dynamic timeout escalation and adaptive connection timeout behavior in azure_data_cosmos_driver, intended to improve retry success rates while preserving fast-path latency.
Changes:
- Introduces a spec for per-timeout ladders (data plane: 6s→10s→65s; metadata: 5s→10s→20s) applied only on timeout-driven retries.
- Defines intended clamping order between ladder values,
ConnectionPoolOptionsbounds, and end-to-end deadlines. - Proposes an adaptive connect-timeout model (1s→5s) tied to sustained connection failures.
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
…iews' into tvaron3/dynamicTimeouts
- Fix metadata min timeout max bound: 65s → 6s (matches code) - Reconcile adaptive 1s initial vs existing 5s default: 1s is a new internal initial value, transitions to max_connect_timeout - Reference Step 6 (PR Azure#3957, now merged) as the introducing PR for ShardedHttpTransport - Pull latest from release/azure_data_cosmos-previews Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
||
| #### Required Changes | ||
|
|
||
| **In `typespec_client_core`** — Add an optional per-request timeout to `Request`: |
There was a problem hiding this comment.
All client and client-method customization is done via the client options and client method options. Same as in every Azure SDK language. You need to add a field to either your client methods or azure_core::http::ClientMethodOptions. Customers should never have to interact with Request. And even if this was for client method implementations, that's what Context is for. Add a type to it that permiates all through the request pipeline. Don't need to modify Request itself.
There was a problem hiding this comment.
Your pipeline should still be similar to the core one. I appreciate why it was necessary and @analogrelay and I discussed this a while ago. I didn't expect it to go off completely on its own.
You'd have to define a field on ClientMethodOptions or your own variant of that. We were trying to avoid deeply nesting, but if you have your own pipeline anyway (which can still look similar to the core pipeline) you can have your own ClientMethodOptions as well.
This affects the public API which is an architectural concern. Per all languages guidelines, SDKs are idiomatic but consistent. Using Cosmos should feel similar (within reason) to any other service crate.
There was a problem hiding this comment.
Your pipeline should still be similar to the core one.
I recognize that, but there are some pretty significant differences here:
- Cosmos has its own diagnostics payloads used across all SDKs (we still integrate with OTel and the azure_core/tracing stack, but there is much more beyond that)
- The Cosmos backend needs user-agent reporting in a different way (prefix instead of suffix, IIRC)
- We need to deal with HTTP/2 limitations on the backend (20 stream per connection limit means we have to manage multiple connections)
- We have partition-level failover, regional failover, circuit breakers, background cache refreshes, and more that just doesn't fit cleanly into the pipeline as Policy.
- Cosmos "Operations" often involve several HTTP requests with significant shared context. Cross-partition queries, ReadMany, etc.
- We have practical experience with HTTP proxies degrading the Cosmos DB availability guarantees, and need to control their use
- Custom HTTP headers don't work the way customers would expect (they aren't guaranteed to be honoured by the backend when using the newer binary protocol)
We can aim for some consistency here, for sure. We're trying to follow guidelines as closely as we can, but I really want to avoid giving customers the false impression of consistency. Where there is true consistency, we should strive to align.
I think this is all mostly aligned with what we talked about yesterday, but I just wanted to include it here for posterity.
| When a Cosmos DB request times out and is retried, using the same timeout value for the retry is | ||
| often suboptimal. Transient network issues or momentary server load spikes may cause an initial |
There was a problem hiding this comment.
This is why the Retry-After header exists and is used by every Azure SDK language. Why does that not work here? Do you need a client-driven alternative?
There was a problem hiding this comment.
retry after is only if we receive a response. This is for scenarios where the request times out do to server doing some work that takes longer than expected.
There was a problem hiding this comment.
Correct, we need to be able to trigger things like hedging where we allow the client to try making a fresh request. Our retry policy goes quite far beyond Retry-After.
| const DATAPLANE_REQUEST_TIMEOUT_LADDER: &[Duration] = &[ | ||
| Duration::from_secs(6), | ||
| Duration::from_secs(10), | ||
| Duration::from_secs(65), | ||
| ]; | ||
|
|
||
| const METADATA_REQUEST_TIMEOUT_LADDER: &[Duration] = &[ | ||
| Duration::from_secs(5), | ||
| Duration::from_secs(10), | ||
| Duration::from_secs(20), | ||
| ]; | ||
|
|
||
| // Inside execute_transport_pipeline(): | ||
| let ladder = match pipeline_type { | ||
| PipelineType::DataPlane => DATAPLANE_REQUEST_TIMEOUT_LADDER, | ||
| PipelineType::Metadata => METADATA_REQUEST_TIMEOUT_LADDER, | ||
| }; | ||
| let mut timeout_retry_count = 0_usize; |
There was a problem hiding this comment.
This whole thing would better be implemented as a custom RetryPolicy Cosmos can add when it creates the Pipeline in its client constructor. You can't remove the built-in RetryPolicy, but you can pass RetryMode::none() instead (this was easier than any default removal options or exclusion options we came up with for Rust).
There was a problem hiding this comment.
We use our own pipeline due to the unique constraints for availability and latency. This is following our current architecture.
There was a problem hiding this comment.
Between this, region routing, hedging, HTTP/2, custom protocols, etc. I just don't think the azure_core pipeline abstractions will help us much here, and trying to rearchitect the abstractions so they can work for us seems counter-productive here.
Replace typespec_client_core/azure_core dependency with driver-internal approach using reqwest::RequestBuilder::timeout() directly per request. No cross-crate changes needed. Key changes: - Remove all typespec_client_core Request/ClientMethodOptions proposals — driver controls reqwest directly via HttpClientFactory - Enforce per-attempt timeout via reqwest::RequestBuilder::timeout() instead of azure_core::sleep() race - Add options hierarchy section showing how users control timeout bounds (ConnectionPoolOptions) and e2e deadline (RuntimeOptions) - Add thin client (Gateway 2.0) future ladder note: 6s, 6s, 10s - Fix markdown lint: table alignment, heading levels, list formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dynamic Timeout Spec
Adds
DYNAMIC_TIMEOUTS_SPEC.mdtosdk/cosmos/azure_data_cosmos_driver/docs/describing a design for escalating request timeouts on transport retries and adaptive connection timeout tuning.Motivation
When a request times out and is retried with the same timeout, it often fails again. Escalating the timeout on each retry gives the operation a better chance of succeeding without making the initial attempt unnecessarily slow.
Key Design Decisions
DatabaseAccountpattern)max_failover_retriesdefaults to 3, which naturally maps to the 3-tier ladder. No retry count change needed.ConnectionPoolOptionsmin/max bounds still clamp the effective valuesShardedHttpTransport)typespec_client_corechange: proposes addingtimeout: Option<Duration>toRequestfor per-request timeout overrides (requires separate cross-team PR)Cross-SDK Alignment
The Java SDK implements timeout escalation for gateway/metadata HTTP calls (QueryPlan: 0.5s→5s→10s, DatabaseAccount: 5s→10s→20s). This spec extends the pattern to both data plane and metadata requests in the Rust driver.
Deferred