[Internal] Add dynamic timeout escalation spec by tvaron3 · Pull Request #3871 · Azure/azure-sdk-for-rust

tvaron3 · 2026-03-05T23:00:48Z

Dynamic Timeout Spec

Adds DYNAMIC_TIMEOUTS_SPEC.md to sdk/cosmos/azure_data_cosmos_driver/docs/ describing a design for escalating request timeouts on transport retries and adaptive connection timeout tuning.

Motivation

When a request times out and is retried with the same timeout, it often fails again. Escalating the timeout on each retry gives the operation a better chance of succeeding without making the initial attempt unnecessarily slow.

Key Design Decisions

Data plane request timeouts: 6s → 10s → 65s (escalating per retry attempt)
Metadata request timeouts: 5s → 10s → 20s (aligned with Java SDK DatabaseAccount pattern)
Existing retry budget is sufficient: max_failover_retries defaults to 3, which naturally maps to the 3-tier ladder. No retry count change needed.
Fixed defaults: the escalation ladder is not user-configurable; existing ConnectionPoolOptions min/max bounds still clamp the effective values
E2E deadline: end-to-end operation deadline still takes precedence over per-attempt ladder values
Adaptive connection timeout: starts at 1s, escalates to 5s on >3 consecutive connection failures per endpoint (one-time persistent transition in ShardedHttpTransport)
typespec_client_core change: proposes adding timeout: Option<Duration> to Request for per-request timeout overrides (requires separate cross-team PR)

Cross-SDK Alignment

The Java SDK implements timeout escalation for gateway/metadata HTTP calls (QueryPlan: 0.5s→5s→10s, DatabaseAccount: 5s→10s→20s). This spec extends the pattern to both data plane and metadata requests in the Rust driver.

Deferred

Query plan timeout escalation (not yet implemented)

Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing escalating connection and request timeouts on transport retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

Resolve 3 of 4 open questions based on review feedback: - Commit to Context-based approach (Option A) for delivering per-attempt timeouts, removing Options B/C. No changes needed in azure_core or typespec_client_core. - Decide hedged attempts always start at tier 0 of the timeout ladder since they target independent regional endpoints. - Record effective per-attempt timeout values in DiagnosticsContext for observability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update the dynamic timeout spec to reflect the new pipeline architecture from PR Azure#3887 (Driver Transport step 02): - Replace MAX_TRANSPORT_RETRIES/TransportRetry references with the new FailoverRetry/SessionRetry model and failover_retry_count - Move retry loop references from cosmos_driver.rs to the 7-stage loop in operation_pipeline.rs - Change timeout delivery from Context type-map to TransportRequest field, matching the new pipeline data flow - Update goal from 'increase retry count' to 'leverage existing retry budget' since max_failover_retries already defaults to 3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

Propose adding an optional per-request timeout field to typespec_client_core::http::Request with set_timeout()/timeout() methods. The reqwest HttpClient implementation applies it via reqwest::RequestBuilder::timeout(). For connection timeout escalation, the reqwest client is built with the maximum ladder value (5s) and the per-request timeout provides the tighter overall bound on each attempt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove all connection timeout escalation from the spec. Connection timeouts remain static (configured via ConnectionPoolOptions). Only request timeouts are escalated per retry attempt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Connection timeouts use a failure-rate adaptive model instead of per-retry escalation: - Start at 1s (sufficient for any cloud/datacenter network) - ShardedHttpTransport monitors connection failure rate - On sustained failures, create new HttpClient instances with 5s connect_timeout (one-time persistent transition) - No azure_core changes needed — entirely internal to the sharded transport Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout) - Rename max_probe_retries to max_probe_attempts, fix off-by-one semantics with exclusive range (3 = 3 total attempts) primitives for runtime-agnostic design - Store CancellationToken in EndpointHealthMonitor, implement Drop to cancel background task on shutdown - Replace snapshot+swap with per-endpoint mark_available/ mark_unavailable calls to avoid lost concurrent updates - Update PR description to match current spec (no env vars, non-blocking startup, no optional blocking) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…om/Azure/azure-sdk-for-rust into tvaron3/dynamicTimeouts

tvaron3

PR Deep Review — §10 Adaptive Connection Timeout

Focused review of the adaptive connection timeout section. 3 comments (1 recommendation, 2 suggestions).

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

- Remove direct mode non-goal (won't be implemented) - Resolve connection failure threshold: >3 consecutive failures per endpoint triggers 1s to 5s escalation - Old shards are actively marked unhealthy on escalation for immediate reclamation by health sweep - Explicitly state per-endpoint failure tracking scope - Remove resolved open questions 2 and 3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- ShardedHttpTransport is no longer 'future integration' — it exists in sharded_transport.rs with per-shard health sweeps - Fix HttpClientFactory::create() → build() - Fix HttpClientConfig connect_timeout claim — connect_timeout is sourced from ConnectionPoolOptions::max_connect_timeout() inside the factory build() method, not as a config field - Reference existing per-request timeout mechanism in transport pipeline (azure_core::sleep racing the HTTP future) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix section numbering (missing §9) - Add rationale for 65s tier-3 jump - Add cross-reference for ladder saturation in §5 - Add typespec_client_core cross-crate PR strategy - Clarify adaptive connection timeout: idempotent transition, per-endpoint aggregation, no-fallback rationale Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Major architectural changes to align with codebase patterns: - Move timeout escalation to transport pipeline level, indexed by transport-level timeout retries (not operation failover) - Use deadline-based pattern (Instant) instead of adding a Duration field to TransportRequest — feeds into existing remaining_request_timeout() mechanism - Remove TimeoutLadder struct; use free function + constants matching DOP style - Remove AttemptTimeoutDiagnostics struct; record flat field directly on RequestDiagnostics - Clarify sleep-race is enforcement mechanism; Request::set_timeout is long-term replacement (never both active) - Use AtomicBool/AtomicU32 for adaptive connection timeout state matching lock-free hot-path pattern - Reference PipelineType for ladder selection - Explicit clamping order: pool bounds first, deadline last - Add note that min_* timeout fields are unused placeholders Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds an internal design specification for dynamic timeout escalation and adaptive connection timeout behavior in azure_data_cosmos_driver, intended to improve retry success rates while preserving fast-path latency.

Changes:

Introduces a spec for per-timeout ladders (data plane: 6s→10s→65s; metadata: 5s→10s→20s) applied only on timeout-driven retries.
Defines intended clamping order between ladder values, ConnectionPoolOptions bounds, and end-to-end deadlines.
Proposes an adaptive connect-timeout model (1s→5s) tied to sustained connection failures.

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

…iews' into tvaron3/dynamicTimeouts

- Fix metadata min timeout max bound: 65s → 6s (matches code) - Reconcile adaptive 1s initial vs existing 5s default: 1s is a new internal initial value, transitions to max_connect_timeout - Reference Step 6 (PR Azure#3957, now merged) as the introducing PR for ShardedHttpTransport - Pull latest from release/azure_data_cosmos-previews Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

heaths · 2026-03-23T17:52:58Z

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

+
+#### Required Changes
+
+**In `typespec_client_core`** — Add an optional per-request timeout to `Request`:


All client and client-method customization is done via the client options and client method options. Same as in every Azure SDK language. You need to add a field to either your client methods or azure_core::http::ClientMethodOptions. Customers should never have to interact with Request. And even if this was for client method implementations, that's what Context is for. Add a type to it that permiates all through the request pipeline. Don't need to modify Request itself.

Your pipeline should still be similar to the core one. I appreciate why it was necessary and @analogrelay and I discussed this a while ago. I didn't expect it to go off completely on its own.

You'd have to define a field on ClientMethodOptions or your own variant of that. We were trying to avoid deeply nesting, but if you have your own pipeline anyway (which can still look similar to the core pipeline) you can have your own ClientMethodOptions as well.

This affects the public API which is an architectural concern. Per all languages guidelines, SDKs are idiomatic but consistent. Using Cosmos should feel similar (within reason) to any other service crate.

Your pipeline should still be similar to the core one.

I recognize that, but there are some pretty significant differences here:

Cosmos has its own diagnostics payloads used across all SDKs (we still integrate with OTel and the azure_core/tracing stack, but there is much more beyond that)

The Cosmos backend needs user-agent reporting in a different way (prefix instead of suffix, IIRC)

We need to deal with HTTP/2 limitations on the backend (20 stream per connection limit means we have to manage multiple connections)

We have partition-level failover, regional failover, circuit breakers, background cache refreshes, and more that just doesn't fit cleanly into the pipeline as Policy.

Cosmos "Operations" often involve several HTTP requests with significant shared context. Cross-partition queries, ReadMany, etc.

We have practical experience with HTTP proxies degrading the Cosmos DB availability guarantees, and need to control their use

Custom HTTP headers don't work the way customers would expect (they aren't guaranteed to be honoured by the backend when using the newer binary protocol)

We can aim for some consistency here, for sure. We're trying to follow guidelines as closely as we can, but I really want to avoid giving customers the false impression of consistency. Where there is true consistency, we should strive to align.

I think this is all mostly aligned with what we talked about yesterday, but I just wanted to include it here for posterity.

heaths · 2026-03-23T17:54:01Z

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

+When a Cosmos DB request times out and is retried, using the same timeout value for the retry is
+often suboptimal. Transient network issues or momentary server load spikes may cause an initial


This is why the Retry-After header exists and is used by every Azure SDK language. Why does that not work here? Do you need a client-driven alternative?

retry after is only if we receive a response. This is for scenarios where the request times out do to server doing some work that takes longer than expected.

Correct, we need to be able to trigger things like hedging where we allow the client to try making a fresh request. Our retry policy goes quite far beyond Retry-After.

heaths · 2026-03-23T17:56:04Z

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

+const DATAPLANE_REQUEST_TIMEOUT_LADDER: &[Duration] = &[
+    Duration::from_secs(6),
+    Duration::from_secs(10),
+    Duration::from_secs(65),
+];
+
+const METADATA_REQUEST_TIMEOUT_LADDER: &[Duration] = &[
+    Duration::from_secs(5),
+    Duration::from_secs(10),
+    Duration::from_secs(20),
+];
+
+// Inside execute_transport_pipeline():
+let ladder = match pipeline_type {
+    PipelineType::DataPlane => DATAPLANE_REQUEST_TIMEOUT_LADDER,
+    PipelineType::Metadata => METADATA_REQUEST_TIMEOUT_LADDER,
+};
+let mut timeout_retry_count = 0_usize;


This whole thing would better be implemented as a custom RetryPolicy Cosmos can add when it creates the Pipeline in its client constructor. You can't remove the built-in RetryPolicy, but you can pass RetryMode::none() instead (this was easier than any default removal options or exclusion options we came up with for Rust).

We use our own pipeline due to the unique constraints for availability and latency. This is following our current architecture.

Between this, region routing, hedging, HTTP/2, custom protocols, etc. I just don't think the azure_core pipeline abstractions will help us much here, and trying to rearchitect the abstractions so they can work for us seems counter-productive here.

Replace typespec_client_core/azure_core dependency with driver-internal approach using reqwest::RequestBuilder::timeout() directly per request. No cross-crate changes needed. Key changes: - Remove all typespec_client_core Request/ClientMethodOptions proposals — driver controls reqwest directly via HttpClientFactory - Enforce per-attempt timeout via reqwest::RequestBuilder::timeout() instead of azure_core::sleep() race - Add options hierarchy section showing how users control timeout bounds (ConnectionPoolOptions) and e2e deadline (RuntimeOptions) - Add thin client (Gateway 2.0) future ladder note: 6s, 6s, 10s - Fix markdown lint: table alignment, heading levels, list formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add dynamic timeout escalation spec

6d95b43

Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing escalating connection and request timeouts on transport retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added the Cosmos The azure_cosmos crate label Mar 5, 2026

github-project-automation bot added this to CosmosDB Go/Rust Crew Mar 5, 2026

github-project-automation bot moved this to Todo in CosmosDB Go/Rust Crew Mar 5, 2026

tvaron3 commented Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 linked an issue Mar 9, 2026 that may be closed by this pull request

Cosmos: Add Support for Gateway Timeouts #3758

Open

tvaron3 and others added 2 commits March 9, 2026 12:07

tvaron3 commented Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 mentioned this pull request Mar 9, 2026

[Cosmos] Health checks spec for gateway endpoints #3866

Closed

tvaron3 and others added 3 commits March 9, 2026 13:20

Merge branch 'release/azure_data_cosmos-previews' of https://github.c…

9b1b79d

…om/Azure/azure-sdk-for-rust into tvaron3/dynamicTimeouts

tvaron3 commented Mar 16, 2026

View reviewed changes

tvaron3 and others added 4 commits March 16, 2026 12:23

tvaron3 marked this pull request as ready for review March 23, 2026 05:37

tvaron3 requested a review from a team as a code owner March 23, 2026 05:37

Copilot AI review requested due to automatic review settings March 23, 2026 05:37

Copilot started reviewing on behalf of tvaron3 March 23, 2026 05:38 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

tvaron3 and others added 2 commits March 23, 2026 10:02

Merge remote-tracking branch 'upstream/release/azure_data_cosmos-prev…

e71c08f

…iews' into tvaron3/dynamicTimeouts

heaths requested changes Mar 23, 2026

View reviewed changes

github-project-automation bot moved this from Todo to Changes Requested in CosmosDB Go/Rust Crew Mar 23, 2026

analogrelay requested a review from heaths March 26, 2026 18:33

tvaron3 marked this pull request as draft March 29, 2026 05:19


		#### Required Changes

		In `typespec_client_core` — Add an optional per-request timeout to `Request`:

		When a Cosmos DB request times out and is retried, using the same timeout value for the retry is
		often suboptimal. Transient network issues or momentary server load spikes may cause an initial

Conversation

tvaron3 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dynamic Timeout Spec

Motivation

Key Design Decisions

Cross-SDK Alignment

Deferred

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tvaron3 left a comment

Choose a reason for hiding this comment

PR Deep Review — §10 Adaptive Connection Timeout

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

analogrelay Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tvaron3 commented Mar 5, 2026 •

edited

Loading

analogrelay Mar 26, 2026 •

edited

Loading