[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689
[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689Meghana-Palaparthi wants to merge 11 commits intomasterfrom
Conversation
|
|
||
| internal class DistributedTransactionCommitter | ||
| { | ||
| private const int MaxRetryAttempts = 3; |
There was a problem hiding this comment.
Is there a timeout for each retry? is it same for all retries?
|
@sdkReviewAgent |
|
sdkReviewAgent | Status: ⏳ Queued Review requested by @xinlian12. I'll start shortly. |
|
sdkReviewAgent | Status: 🔍 Reviewing I'm reviewing this PR now. I'll post my findings as comments when done. |
| CancellationToken cancellationToken) | ||
| { | ||
| int attempt = 0; | ||
| while (true) |
There was a problem hiding this comment.
🟡 Recommendation: Unbounded retry loop needs a safety ceiling
This while (true) loop retries indefinitely — the only exits are success, a non-retriable error, or CancellationToken cancellation. Every other retry policy in this SDK enforces a hard bound:
| Policy | Max |
|---|---|
ClientRetryPolicy |
120 |
ResourceThrottleRetryPolicy |
9 + 30s cumulative |
BulkExecutionRetryPolicy |
10 |
WebExceptionRetryPolicy |
30s window |
The public API CommitTransactionAsync(CancellationToken cancellationToken = default) means a caller can invoke this with CancellationToken.None. If the server keeps returning isRetriable: true (e.g. during a prolonged service issue), the task never completes — a silent hang that's extremely hard to diagnose.
I understand this is intentional per commit d547817fa ("Change to unbounded retries") and that idempotent transactions can safely retry. However, even a generous safety ceiling (e.g. 30–120 attempts, consistent with ClientRetryPolicy) would prevent pathological hangs without limiting real-world usage. A final trace error before giving up would also aid diagnostics.
At minimum, consider documenting in the public API that callers must provide a meaningful cancellation token with a timeout.
📎 Validated against SDK retry conventions (BackoffRetryUtility, ClientRetryPolicy, ResourceThrottleRetryPolicy) — all use bounded retries.
| catch (CosmosException cosmosEx) when ( | ||
| !cancellationToken.IsCancellationRequested | ||
| && cosmosEx.StatusCode == HttpStatusCode.RequestTimeout) | ||
| { | ||
| DefaultTrace.TraceWarning( | ||
| $"Distributed transaction commit timed out (attempt {attempt + 1}). " + | ||
| $"Retrying with idempotency token {serverRequest.IdempotencyToken}."); | ||
| await Task.Delay(this.GetRetryDelay(attempt), cancellationToken); | ||
| attempt++; | ||
| continue; | ||
| } | ||
|
|
||
| if (!response.IsSuccessStatusCode | ||
| && (response.IsRetriable || response.StatusCode == HttpStatusCode.RequestTimeout)) |
There was a problem hiding this comment.
🟡 Recommendation: Potential double-retry amplification on HTTP 408
The outer DTS loop retries on 408 both via the CosmosException catch (line 84) and the response status check (line 95). However, ExecuteCommitAsync calls ProcessResourceOperationStreamAsync, which flows through RetryHandler → ClientRetryPolicy. ClientRetryPolicy.ShouldRetryOnEndpointFailureAsync also handles 408 by marking endpoints unavailable and retrying with failover.
This creates two nested retry loops for 408:
- Inner (pipeline): Up to 120 retries for endpoint failover
- Outer (DTS): Unbounded retries
Each outer retry invocation spins up a fresh inner retry cycle, so a persistent 408 condition could generate 120 × N total HTTP requests.
The isRetriable JSON flag is clearly DTS-specific and justified at this layer. For 408 specifically, could you confirm whether the gateway-mode pipeline (UseGatewayMode = true) already handles 408 retries at the transport level? If so, removing the explicit 408 check from the outer loop (relying only on isRetriable) would avoid amplification. If gateway mode bypasses transport-level 408 retries, then this outer check is necessary — a code comment explaining why would help future maintainers.
📎 Traced through RequestHandler → RetryHandler → ClientRetryPolicy pipeline.
There was a problem hiding this comment.
I traced through the pipeline and the retry amplification doesn't occur.
ClientRetryPolicy does not retry 408. In ShouldRetryInternalAsync, the 408 block only calls TryMarkEndpointUnavailableForPkRange and then falls through with no return. It ends up causing AbstractRetryHandler to return the ResponseMessage(408) straight to the caller — no inner retry.
So the outer DTS retry is the only retry for 408. Each outer attempt = exactly one inner pipeline call.
Description
This pull request introduces robust retry logic for distributed transaction commits and improves the handling and parsing of distributed transaction responses in the Cosmos DB SDK. The changes ensure that commit operations are retried safely in the event of timeouts or retriable errors, enhance diagnostics, and make response parsing more resilient. Additionally, the request and response classes are refactored for safer stream handling and improved reliability.
Distributed transaction commit improvements:
DistributedTransactionCommitterResponse parsing and diagnostics enhancements:
isRetriableandserverDiagnosticsfields, and improved resilience to partial JSON parsing failures.IsRetriableproperty toDistributedTransactionResponseand ensured it is correctly populated from server responses.Request stream handling improvements:
DistributedTransactionServerRequestto use a pre-serialized byte array for the request body, enabling safe creation of new memory streams for each retry and preventing disposal issues.Reliability and correctness fixes:
DistributedTransactionResponse.DistributedTransactionOperationResultto throw explicit exceptions on failure.Miscellaneous:
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber