Skip to content

[Internal] Add dynamic timeout escalation spec#3871

Draft
tvaron3 wants to merge 14 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/dynamicTimeouts
Draft

[Internal] Add dynamic timeout escalation spec#3871
tvaron3 wants to merge 14 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/dynamicTimeouts

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented Mar 5, 2026

Dynamic Timeout Spec

Adds DYNAMIC_TIMEOUTS_SPEC.md to sdk/cosmos/azure_data_cosmos_driver/docs/ describing a design for escalating request timeouts on transport retries and adaptive connection timeout tuning.

Motivation

When a request times out and is retried with the same timeout, it often fails again. Escalating the timeout on each retry gives the operation a better chance of succeeding without making the initial attempt unnecessarily slow.

Key Design Decisions

  • Data plane request timeouts: 6s → 10s → 65s (escalating per retry attempt)
  • Metadata request timeouts: 5s → 10s → 20s (aligned with Java SDK DatabaseAccount pattern)
  • Existing retry budget is sufficient: max_failover_retries defaults to 3, which naturally maps to the 3-tier ladder. No retry count change needed.
  • Fixed defaults: the escalation ladder is not user-configurable; existing ConnectionPoolOptions min/max bounds still clamp the effective values
  • E2E deadline: end-to-end operation deadline still takes precedence over per-attempt ladder values
  • Adaptive connection timeout: starts at 1s, escalates to 5s on >3 consecutive connection failures per endpoint (one-time persistent transition in ShardedHttpTransport)
  • typespec_client_core change: proposes adding timeout: Option<Duration> to Request for per-request timeout overrides (requires separate cross-team PR)

Cross-SDK Alignment

The Java SDK implements timeout escalation for gateway/metadata HTTP calls (QueryPlan: 0.5s→5s→10s, DatabaseAccount: 5s→10s→20s). This spec extends the pattern to both data plane and metadata requests in the Rust driver.

Deferred

  • Query plan timeout escalation (not yet implemented)

Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing
escalating connection and request timeouts on transport retries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 linked an issue Mar 9, 2026 that may be closed by this pull request
tvaron3 and others added 2 commits March 9, 2026 12:07
Resolve 3 of 4 open questions based on review feedback:

- Commit to Context-based approach (Option A) for delivering
  per-attempt timeouts, removing Options B/C. No changes needed
  in azure_core or typespec_client_core.
- Decide hedged attempts always start at tier 0 of the timeout
  ladder since they target independent regional endpoints.
- Record effective per-attempt timeout values in DiagnosticsContext
  for observability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the dynamic timeout spec to reflect the new pipeline
architecture from PR Azure#3887 (Driver Transport step 02):

- Replace MAX_TRANSPORT_RETRIES/TransportRetry references with
  the new FailoverRetry/SessionRetry model and failover_retry_count
- Move retry loop references from cosmos_driver.rs to the 7-stage
  loop in operation_pipeline.rs
- Change timeout delivery from Context type-map to TransportRequest
  field, matching the new pipeline data flow
- Update goal from 'increase retry count' to 'leverage existing
  retry budget' since max_failover_retries already defaults to 3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 3 commits March 9, 2026 13:20
Propose adding an optional per-request timeout field to
typespec_client_core::http::Request with set_timeout()/timeout()
methods. The reqwest HttpClient implementation applies it via
reqwest::RequestBuilder::timeout().

For connection timeout escalation, the reqwest client is built
with the maximum ladder value (5s) and the per-request timeout
provides the tighter overall bound on each attempt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove all connection timeout escalation from the spec.
Connection timeouts remain static (configured via
ConnectionPoolOptions). Only request timeouts are escalated
per retry attempt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Connection timeouts use a failure-rate adaptive model instead
of per-retry escalation:

- Start at 1s (sufficient for any cloud/datacenter network)
- ShardedHttpTransport monitors connection failure rate
- On sustained failures, create new HttpClient instances with
  5s connect_timeout (one-time persistent transition)
- No azure_core changes needed — entirely internal to the
  sharded transport

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 added a commit to tvaron3/azure-sdk-for-rust that referenced this pull request Mar 9, 2026
- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout)
- Rename max_probe_retries to max_probe_attempts, fix off-by-one
  semantics with exclusive range (3 = 3 total attempts)
  primitives for runtime-agnostic design
- Store CancellationToken in EndpointHealthMonitor, implement
  Drop to cancel background task on shutdown
- Replace snapshot+swap with per-endpoint mark_available/
  mark_unavailable calls to avoid lost concurrent updates
- Update PR description to match current spec (no env vars,
  non-blocking startup, no optional blocking)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member Author

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Deep Review — §10 Adaptive Connection Timeout

Focused review of the adaptive connection timeout section. 3 comments (1 recommendation, 2 suggestions).

tvaron3 and others added 4 commits March 16, 2026 12:23
- Remove direct mode non-goal (won't be implemented)
- Resolve connection failure threshold: >3 consecutive failures
  per endpoint triggers 1s to 5s escalation
- Old shards are actively marked unhealthy on escalation for
  immediate reclamation by health sweep
- Explicitly state per-endpoint failure tracking scope
- Remove resolved open questions 2 and 3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ShardedHttpTransport is no longer 'future integration' — it
  exists in sharded_transport.rs with per-shard health sweeps
- Fix HttpClientFactory::create() → build()
- Fix HttpClientConfig connect_timeout claim — connect_timeout
  is sourced from ConnectionPoolOptions::max_connect_timeout()
  inside the factory build() method, not as a config field
- Reference existing per-request timeout mechanism in transport
  pipeline (azure_core::sleep racing the HTTP future)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix section numbering (missing §9)
- Add rationale for 65s tier-3 jump
- Add cross-reference for ladder saturation in §5
- Add typespec_client_core cross-crate PR strategy
- Clarify adaptive connection timeout: idempotent transition,
  per-endpoint aggregation, no-fallback rationale

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Major architectural changes to align with codebase patterns:

- Move timeout escalation to transport pipeline level, indexed
  by transport-level timeout retries (not operation failover)
- Use deadline-based pattern (Instant) instead of adding a
  Duration field to TransportRequest — feeds into existing
  remaining_request_timeout() mechanism
- Remove TimeoutLadder struct; use free function + constants
  matching DOP style
- Remove AttemptTimeoutDiagnostics struct; record flat field
  directly on RequestDiagnostics
- Clarify sleep-race is enforcement mechanism; Request::set_timeout
  is long-term replacement (never both active)
- Use AtomicBool/AtomicU32 for adaptive connection timeout state
  matching lock-free hot-path pattern
- Reference PipelineType for ladder selection
- Explicit clamping order: pool bounds first, deadline last
- Add note that min_* timeout fields are unused placeholders

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 marked this pull request as ready for review March 23, 2026 05:37
@tvaron3 tvaron3 requested a review from a team as a code owner March 23, 2026 05:37
Copilot AI review requested due to automatic review settings March 23, 2026 05:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an internal design specification for dynamic timeout escalation and adaptive connection timeout behavior in azure_data_cosmos_driver, intended to improve retry success rates while preserving fast-path latency.

Changes:

  • Introduces a spec for per-timeout ladders (data plane: 6s→10s→65s; metadata: 5s→10s→20s) applied only on timeout-driven retries.
  • Defines intended clamping order between ladder values, ConnectionPoolOptions bounds, and end-to-end deadlines.
  • Proposes an adaptive connect-timeout model (1s→5s) tied to sustained connection failures.

tvaron3 and others added 2 commits March 23, 2026 10:02
- Fix metadata min timeout max bound: 65s → 6s (matches code)
- Reconcile adaptive 1s initial vs existing 5s default: 1s is
  a new internal initial value, transitions to max_connect_timeout
- Reference Step 6 (PR Azure#3957, now merged) as the introducing PR
  for ShardedHttpTransport
- Pull latest from release/azure_data_cosmos-previews

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

#### Required Changes

**In `typespec_client_core`** — Add an optional per-request timeout to `Request`:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All client and client-method customization is done via the client options and client method options. Same as in every Azure SDK language. You need to add a field to either your client methods or azure_core::http::ClientMethodOptions. Customers should never have to interact with Request. And even if this was for client method implementations, that's what Context is for. Add a type to it that permiates all through the request pipeline. Don't need to modify Request itself.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your pipeline should still be similar to the core one. I appreciate why it was necessary and @analogrelay and I discussed this a while ago. I didn't expect it to go off completely on its own.

You'd have to define a field on ClientMethodOptions or your own variant of that. We were trying to avoid deeply nesting, but if you have your own pipeline anyway (which can still look similar to the core pipeline) you can have your own ClientMethodOptions as well.

This affects the public API which is an architectural concern. Per all languages guidelines, SDKs are idiomatic but consistent. Using Cosmos should feel similar (within reason) to any other service crate.

Copy link
Copy Markdown
Member

@analogrelay analogrelay Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your pipeline should still be similar to the core one.

I recognize that, but there are some pretty significant differences here:

  • Cosmos has its own diagnostics payloads used across all SDKs (we still integrate with OTel and the azure_core/tracing stack, but there is much more beyond that)
  • The Cosmos backend needs user-agent reporting in a different way (prefix instead of suffix, IIRC)
  • We need to deal with HTTP/2 limitations on the backend (20 stream per connection limit means we have to manage multiple connections)
  • We have partition-level failover, regional failover, circuit breakers, background cache refreshes, and more that just doesn't fit cleanly into the pipeline as Policy.
  • Cosmos "Operations" often involve several HTTP requests with significant shared context. Cross-partition queries, ReadMany, etc.
  • We have practical experience with HTTP proxies degrading the Cosmos DB availability guarantees, and need to control their use
  • Custom HTTP headers don't work the way customers would expect (they aren't guaranteed to be honoured by the backend when using the newer binary protocol)

We can aim for some consistency here, for sure. We're trying to follow guidelines as closely as we can, but I really want to avoid giving customers the false impression of consistency. Where there is true consistency, we should strive to align.

I think this is all mostly aligned with what we talked about yesterday, but I just wanted to include it here for posterity.

Comment on lines +29 to +30
When a Cosmos DB request times out and is retried, using the same timeout value for the retry is
often suboptimal. Transient network issues or momentary server load spikes may cause an initial
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why the Retry-After header exists and is used by every Azure SDK language. Why does that not work here? Do you need a client-driven alternative?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry after is only if we receive a response. This is for scenarios where the request times out do to server doing some work that takes longer than expected.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we need to be able to trigger things like hedging where we allow the client to try making a fresh request. Our retry policy goes quite far beyond Retry-After.

Comment on lines +214 to +231
const DATAPLANE_REQUEST_TIMEOUT_LADDER: &[Duration] = &[
Duration::from_secs(6),
Duration::from_secs(10),
Duration::from_secs(65),
];

const METADATA_REQUEST_TIMEOUT_LADDER: &[Duration] = &[
Duration::from_secs(5),
Duration::from_secs(10),
Duration::from_secs(20),
];

// Inside execute_transport_pipeline():
let ladder = match pipeline_type {
PipelineType::DataPlane => DATAPLANE_REQUEST_TIMEOUT_LADDER,
PipelineType::Metadata => METADATA_REQUEST_TIMEOUT_LADDER,
};
let mut timeout_retry_count = 0_usize;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing would better be implemented as a custom RetryPolicy Cosmos can add when it creates the Pipeline in its client constructor. You can't remove the built-in RetryPolicy, but you can pass RetryMode::none() instead (this was easier than any default removal options or exclusion options we came up with for Rust).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use our own pipeline due to the unique constraints for availability and latency. This is following our current architecture.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between this, region routing, hedging, HTTP/2, custom protocols, etc. I just don't think the azure_core pipeline abstractions will help us much here, and trying to rearchitect the abstractions so they can work for us seems counter-productive here.

@github-project-automation github-project-automation bot moved this from Todo to Changes Requested in CosmosDB Go/Rust Crew Mar 23, 2026
Replace typespec_client_core/azure_core dependency with
driver-internal approach using reqwest::RequestBuilder::timeout()
directly per request. No cross-crate changes needed.

Key changes:
- Remove all typespec_client_core Request/ClientMethodOptions
  proposals — driver controls reqwest directly via HttpClientFactory
- Enforce per-attempt timeout via reqwest::RequestBuilder::timeout()
  instead of azure_core::sleep() race
- Add options hierarchy section showing how users control timeout
  bounds (ConnectionPoolOptions) and e2e deadline (RuntimeOptions)
- Add thin client (Gateway 2.0) future ladder note: 6s, 6s, 10s
- Fix markdown lint: table alignment, heading levels, list formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@analogrelay analogrelay requested a review from heaths March 26, 2026 18:33
@tvaron3 tvaron3 marked this pull request as draft March 29, 2026 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Changes Requested

Development

Successfully merging this pull request may close these issues.

Cosmos: Add Support for Gateway Timeouts

4 participants