Skip to content

[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request #4031

@tvaron3

Description

@tvaron3

Bug Report

Summary

PR #4005 changed the pk_range_cache key from container name to collection RID. However, the code that constructs the URL for fetching partition key ranges still uses .item(collection_rid) instead of .item_by_rid(collection_rid). This causes the RID to be URL-encoded (e.g., =%3D), resulting in a 404 from Cosmos DB on every partition key range fetch attempt.

Because try_lookup silently swallows errors via Ok(routing_map.ok()) and AsyncCache does not cache errors, this failure repeats on every single request, causing:

  1. 1.6M extra 404 requests/hour observed on a benchmark account after deploying the change
  2. Write lock contention in AsyncCache as every concurrent operation serializes through the failed fetch path
  3. Loss of client-side partition key routing — the gateway must route all requests instead
  4. ~7% throughput regression observed in continuous benchmarks (110M → 102M requests/hour)

Root Cause (3-step chain)

Step 1: Wrong URL encoding — .item() vs .item_by_rid()

In partition_key_range_cache.rs, get_routing_map_for_collection():

let pk_range_link = self
    .database_link                       // dbs/perfdb
    .feed(ResourceType::Containers)
    .item(collection_rid)                // ← BUG: .item() URL-encodes the RID
    .feed(ResourceType::PartitionKeyRanges);

.item() calls LinkSegment::new() which URL-encodes the value. RIDs like pLLZAIuPigw= get the = encoded to %3D:

dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges  ← Cosmos DB returns 404

Should use .item_by_rid() which calls LinkSegment::identity() (no encoding):

dbs/perfdb/colls/pLLZAIuPigw=/pkranges    ← correct

Step 2: Error silently swallowed

In partition_key_range_cache.rs line 147, try_lookup():

Ok(routing_map.ok())  // Converts Err(404) → Ok(None), invisible to caller

The caller in container_connection.rs sees Ok(None) and skips the routing block entirely:

let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?;
if let Some(routing_map) = routing_map {
    // SKIPPED — no client-side partition key range resolution
}

Step 3: Errors not cached → retried on every request

AsyncCache::get() only caches successful values. When compute() returns Err, the error propagates and the cache remains empty. Every subsequent request:

  1. Read lock → cache miss
  2. Acquire write lock (serializes all concurrent operations on the same key)
  3. HTTP request to Cosmos DB → 404
  4. Error propagated, cache stays empty
  5. Error swallowed as Ok(None)
  6. Routing bypassed

Evidence from Benchmarks

Continuous benchmark on cosmos-perf-rg (4 pods, concurrency=100, 400K RU/s):

Hour (UTC) 404 Count Notes
13:02 – 17:02 3,500 – 6,300 Normal background
18:02 1,645,604 After deploying commit 98d01c8

Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression). Server-side latency actually decreased (fewer effective requests reaching the service), confirming the bottleneck is client-side.

Suggested Fix

// partition_key_range_cache.rs, get_routing_map_for_collection()
let pk_range_link = self
    .database_link
    .feed(ResourceType::Containers)
-   .item(collection_rid)
+   .item_by_rid(collection_rid)
    .feed(ResourceType::PartitionKeyRanges);

Additionally, consider:

  • Logging errors in try_lookup before swallowing them, to make silent failures visible
  • Adding negative caching (or a backoff) in AsyncCache to avoid retrying failed fetches on every request

Affected Version

Commit 98d01c8 on release/azure_data_cosmos-previews branch (PR #4005).

Metadata

Metadata

Labels

CosmosThe azure_cosmos cratebugThis issue requires a change to an existing behavior in the product in order to be resolved.

Type

No type

Projects

Status

Done

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions