[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout by Meghana-Palaparthi · Pull Request #5689 · Azure/azure-cosmos-dotnet-v3

Meghana-Palaparthi · 2026-03-10T20:15:03Z

Description

This pull request introduces robust retry logic for distributed transaction commits and improves the handling and parsing of distributed transaction responses in the Cosmos DB SDK. The changes ensure that commit operations are retried safely in the event of timeouts or retriable errors, enhance diagnostics, and make response parsing more resilient. Additionally, the request and response classes are refactored for safer stream handling and improved reliability.

Distributed transaction commit improvements:

Added exponential backoff retry logic for distributed transaction commits, specifically handling timeouts and retriable errors with idempotency token support in DistributedTransactionCommitter
Improved error handling to distinguish between cancellation and other exceptions during commit attempts.

Response parsing and diagnostics enhancements:

Enhanced distributed transaction response parsing to extract isRetriable and serverDiagnostics fields, and improved resilience to partial JSON parsing failures.
Added the IsRetriable property to DistributedTransactionResponse and ensured it is correctly populated from server responses.

Request stream handling improvements:

Refactored DistributedTransactionServerRequest to use a pre-serialized byte array for the request body, enabling safe creation of new memory streams for each retry and preventing disposal issues.

Reliability and correctness fixes:

Ensured proper disposal checks in enumerator and count properties of DistributedTransactionResponse.
Improved deserialization error handling in DistributedTransactionOperationResult to throw explicit exceptions on failure.

Miscellaneous:

Minor cleanup and refactoring for resource URI handling and idempotency token extraction.

Type of change

Please delete options that are not relevant.

[] Bug fix (non-breaking change which fixes an issue)
[✓] New feature (non-breaking change which adds functionality)
[] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

Praveen-Msft · 2026-03-11T23:25:25Z

Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionCommitter.cs


    internal class DistributedTransactionCommitter
    {
+        private const int MaxRetryAttempts = 3;


Is there a timeout for each retry? is it same for all retries?

xinlian12 · 2026-03-19T21:51:17Z

@sdkReviewAgent

xinlian12 · 2026-03-19T21:52:06Z

sdkReviewAgent | Status: ⏳ Queued

Review requested by @xinlian12. I'll start shortly.

xinlian12 · 2026-03-19T21:52:07Z

sdkReviewAgent | Status: 🔍 Reviewing

I'm reviewing this PR now. I'll post my findings as comments when done.

xinlian12 · 2026-03-19T22:12:44Z

Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionCommitter.cs

+            CancellationToken cancellationToken)
+        {
+            int attempt = 0;
+            while (true)


🟡 Recommendation: Unbounded retry loop needs a safety ceiling

This while (true) loop retries indefinitely — the only exits are success, a non-retriable error, or CancellationToken cancellation. Every other retry policy in this SDK enforces a hard bound:

Policy Max

ClientRetryPolicy 120

ResourceThrottleRetryPolicy 9 + 30s cumulative

BulkExecutionRetryPolicy 10

WebExceptionRetryPolicy 30s window

The public API CommitTransactionAsync(CancellationToken cancellationToken = default) means a caller can invoke this with CancellationToken.None. If the server keeps returning isRetriable: true (e.g. during a prolonged service issue), the task never completes — a silent hang that's extremely hard to diagnose.

I understand this is intentional per commit d547817fa ("Change to unbounded retries") and that idempotent transactions can safely retry. However, even a generous safety ceiling (e.g. 30–120 attempts, consistent with ClientRetryPolicy) would prevent pathological hangs without limiting real-world usage. A final trace error before giving up would also aid diagnostics.

At minimum, consider documenting in the public API that callers must provide a meaningful cancellation token with a timeout.

_{📎 Validated against SDK retry conventions (BackoffRetryUtility, ClientRetryPolicy, ResourceThrottleRetryPolicy) — all use bounded retries.}

xinlian12 · 2026-03-19T22:12:58Z

Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionCommitter.cs

+                catch (CosmosException cosmosEx) when (
+                    !cancellationToken.IsCancellationRequested
+                    && cosmosEx.StatusCode == HttpStatusCode.RequestTimeout)
+                {
+                    DefaultTrace.TraceWarning(
+                        $"Distributed transaction commit timed out (attempt {attempt + 1}). " +
+                        $"Retrying with idempotency token {serverRequest.IdempotencyToken}.");
+                    await Task.Delay(this.GetRetryDelay(attempt), cancellationToken);
+                    attempt++;
+                    continue;
+                }
+
+                if (!response.IsSuccessStatusCode
+                    && (response.IsRetriable || response.StatusCode == HttpStatusCode.RequestTimeout))


🟡 Recommendation: Potential double-retry amplification on HTTP 408

The outer DTS loop retries on 408 both via the CosmosException catch (line 84) and the response status check (line 95). However, ExecuteCommitAsync calls ProcessResourceOperationStreamAsync, which flows through RetryHandler → ClientRetryPolicy. ClientRetryPolicy.ShouldRetryOnEndpointFailureAsync also handles 408 by marking endpoints unavailable and retrying with failover.

This creates two nested retry loops for 408:

Inner (pipeline): Up to 120 retries for endpoint failover

Outer (DTS): Unbounded retries

Each outer retry invocation spins up a fresh inner retry cycle, so a persistent 408 condition could generate 120 × N total HTTP requests.

The isRetriable JSON flag is clearly DTS-specific and justified at this layer. For 408 specifically, could you confirm whether the gateway-mode pipeline (UseGatewayMode = true) already handles 408 retries at the transport level? If so, removing the explicit 408 check from the outer loop (relying only on isRetriable) would avoid amplification. If gateway mode bypasses transport-level 408 retries, then this outer check is necessary — a code comment explaining why would help future maintainers.

_{📎 Traced through RequestHandler → RetryHandler → ClientRetryPolicy pipeline.}

I traced through the pipeline and the retry amplification doesn't occur.

ClientRetryPolicy does not retry 408. In ShouldRetryInternalAsync, the 408 block only calls TryMarkEndpointUnavailableForPkRange and then falls through with no return. It ends up causing AbstractRetryHandler to return the ResponseMessage(408) straight to the caller — no inner retry.

So the outer DTS retry is the only retry for 408. Each outer attempt = exactly one inner pipeline call.

Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionCommitter.cs

Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionResponse.cs

Meghana-Palaparthi added 3 commits March 10, 2026 14:59

Adds retries in DTS when isRetriable is true and on timeout

ae0aa5b

Adds tests on DTS retries

2eb6832

Remove server diagnostics from response deserialization

3d69ceb

Meghana-Palaparthi marked this pull request as ready for review March 11, 2026 21:03

Meghana-Palaparthi requested review from FabianMeiswinkel, Pilchie, adityasa, khdang, kirankumarkolli, kirillg, neildsh and sboshra as code owners March 11, 2026 21:03

Praveen-Msft reviewed Mar 11, 2026

View reviewed changes

Meghana-Palaparthi added 4 commits March 16, 2026 11:56

Merge branch 'master' into users/Meghana-Palaparthi/DTS_timeout_handling

bc5e52c

Change to unbounded retries

d547817

Merge branch 'master' into users/Meghana-Palaparthi/DTS_timeout_handling

409b3e6

Merge branch 'master' into users/Meghana-Palaparthi/DTS_timeout_handling

a84a6a8

xinlian12 reviewed Mar 19, 2026

View reviewed changes

Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionCommitter.cs Outdated Show resolved Hide resolved

xinlian12 reviewed Mar 19, 2026

View reviewed changes

Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionResponse.cs Outdated Show resolved Hide resolved

Meghana-Palaparthi added 4 commits March 24, 2026 13:18

Merge branch 'master' into users/Meghana-Palaparthi/DTS_timeout_handling

414afb4

address feedback on PR

ec997d1

Merge branch 'master' into users/Meghana-Palaparthi/DTS_timeout_handling

72c47cf

Merge branch 'master' into users/Meghana-Palaparthi/DTS_timeout_handling

9f27546

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689

[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689
Meghana-Palaparthi wants to merge 11 commits intomasterfrom
users/Meghana-Palaparthi/DTS_timeout_handling

Meghana-Palaparthi commented Mar 10, 2026

Uh oh!

Praveen-Msft Mar 11, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 Mar 19, 2026

Uh oh!

xinlian12 Mar 19, 2026

Uh oh!

Meghana-Palaparthi Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Policy	Max
`ClientRetryPolicy`	120
`ResourceThrottleRetryPolicy`	9 + 30s cumulative
`BulkExecutionRetryPolicy`	10
`WebExceptionRetryPolicy`	30s window

Conversation

Meghana-Palaparthi commented Mar 10, 2026

Description

Type of change

Closing issues

Uh oh!

Praveen-Msft Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

xinlian12 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Meghana-Palaparthi Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants