Skip to content

[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689

Open
Meghana-Palaparthi wants to merge 11 commits intomasterfrom
users/Meghana-Palaparthi/DTS_timeout_handling
Open

[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689
Meghana-Palaparthi wants to merge 11 commits intomasterfrom
users/Meghana-Palaparthi/DTS_timeout_handling

Conversation

@Meghana-Palaparthi
Copy link
Copy Markdown
Contributor

Description

This pull request introduces robust retry logic for distributed transaction commits and improves the handling and parsing of distributed transaction responses in the Cosmos DB SDK. The changes ensure that commit operations are retried safely in the event of timeouts or retriable errors, enhance diagnostics, and make response parsing more resilient. Additionally, the request and response classes are refactored for safer stream handling and improved reliability.

Distributed transaction commit improvements:

  • Added exponential backoff retry logic for distributed transaction commits, specifically handling timeouts and retriable errors with idempotency token support in DistributedTransactionCommitter
  • Improved error handling to distinguish between cancellation and other exceptions during commit attempts.

Response parsing and diagnostics enhancements:

  • Enhanced distributed transaction response parsing to extract isRetriable and serverDiagnostics fields, and improved resilience to partial JSON parsing failures.
  • Added the IsRetriable property to DistributedTransactionResponse and ensured it is correctly populated from server responses.

Request stream handling improvements:

  • Refactored DistributedTransactionServerRequest to use a pre-serialized byte array for the request body, enabling safe creation of new memory streams for each retry and preventing disposal issues.

Reliability and correctness fixes:

  • Ensured proper disposal checks in enumerator and count properties of DistributedTransactionResponse.
  • Improved deserialization error handling in DistributedTransactionOperationResult to throw explicit exceptions on failure.

Miscellaneous:

  • Minor cleanup and refactoring for resource URI handling and idempotency token extraction.

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • [✓] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber


internal class DistributedTransactionCommitter
{
private const int MaxRetryAttempts = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a timeout for each retry? is it same for all retries?

@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

@xinlian12
Copy link
Copy Markdown
Member

sdkReviewAgent | Status: ⏳ Queued

Review requested by @xinlian12. I'll start shortly.

@xinlian12
Copy link
Copy Markdown
Member

sdkReviewAgent | Status: 🔍 Reviewing

I'm reviewing this PR now. I'll post my findings as comments when done.

CancellationToken cancellationToken)
{
int attempt = 0;
while (true)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation: Unbounded retry loop needs a safety ceiling

This while (true) loop retries indefinitely — the only exits are success, a non-retriable error, or CancellationToken cancellation. Every other retry policy in this SDK enforces a hard bound:

Policy Max
ClientRetryPolicy 120
ResourceThrottleRetryPolicy 9 + 30s cumulative
BulkExecutionRetryPolicy 10
WebExceptionRetryPolicy 30s window

The public API CommitTransactionAsync(CancellationToken cancellationToken = default) means a caller can invoke this with CancellationToken.None. If the server keeps returning isRetriable: true (e.g. during a prolonged service issue), the task never completes — a silent hang that's extremely hard to diagnose.

I understand this is intentional per commit d547817fa ("Change to unbounded retries") and that idempotent transactions can safely retry. However, even a generous safety ceiling (e.g. 30–120 attempts, consistent with ClientRetryPolicy) would prevent pathological hangs without limiting real-world usage. A final trace error before giving up would also aid diagnostics.

At minimum, consider documenting in the public API that callers must provide a meaningful cancellation token with a timeout.

📎 Validated against SDK retry conventions (BackoffRetryUtility, ClientRetryPolicy, ResourceThrottleRetryPolicy) — all use bounded retries.

Comment on lines +82 to +95
catch (CosmosException cosmosEx) when (
!cancellationToken.IsCancellationRequested
&& cosmosEx.StatusCode == HttpStatusCode.RequestTimeout)
{
DefaultTrace.TraceWarning(
$"Distributed transaction commit timed out (attempt {attempt + 1}). " +
$"Retrying with idempotency token {serverRequest.IdempotencyToken}.");
await Task.Delay(this.GetRetryDelay(attempt), cancellationToken);
attempt++;
continue;
}

if (!response.IsSuccessStatusCode
&& (response.IsRetriable || response.StatusCode == HttpStatusCode.RequestTimeout))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation: Potential double-retry amplification on HTTP 408

The outer DTS loop retries on 408 both via the CosmosException catch (line 84) and the response status check (line 95). However, ExecuteCommitAsync calls ProcessResourceOperationStreamAsync, which flows through RetryHandlerClientRetryPolicy. ClientRetryPolicy.ShouldRetryOnEndpointFailureAsync also handles 408 by marking endpoints unavailable and retrying with failover.

This creates two nested retry loops for 408:

  • Inner (pipeline): Up to 120 retries for endpoint failover
  • Outer (DTS): Unbounded retries

Each outer retry invocation spins up a fresh inner retry cycle, so a persistent 408 condition could generate 120 × N total HTTP requests.

The isRetriable JSON flag is clearly DTS-specific and justified at this layer. For 408 specifically, could you confirm whether the gateway-mode pipeline (UseGatewayMode = true) already handles 408 retries at the transport level? If so, removing the explicit 408 check from the outer loop (relying only on isRetriable) would avoid amplification. If gateway mode bypasses transport-level 408 retries, then this outer check is necessary — a code comment explaining why would help future maintainers.

📎 Traced through RequestHandlerRetryHandlerClientRetryPolicy pipeline.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I traced through the pipeline and the retry amplification doesn't occur.

ClientRetryPolicy does not retry 408. In ShouldRetryInternalAsync, the 408 block only calls TryMarkEndpointUnavailableForPkRange and then falls through with no return. It ends up causing AbstractRetryHandler to return the ResponseMessage(408) straight to the caller — no inner retry.

So the outer DTS retry is the only retry for 408. Each outer attempt = exactly one inner pipeline call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants