-
Notifications
You must be signed in to change notification settings - Fork 342
[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request #4031
Description
Bug Report
Summary
PR #4005 changed the pk_range_cache key from container name to collection RID. However, the code that constructs the URL for fetching partition key ranges still uses .item(collection_rid) instead of .item_by_rid(collection_rid). This causes the RID to be URL-encoded (e.g., = → %3D), resulting in a 404 from Cosmos DB on every partition key range fetch attempt.
Because try_lookup silently swallows errors via Ok(routing_map.ok()) and AsyncCache does not cache errors, this failure repeats on every single request, causing:
- 1.6M extra 404 requests/hour observed on a benchmark account after deploying the change
- Write lock contention in
AsyncCacheas every concurrent operation serializes through the failed fetch path - Loss of client-side partition key routing — the gateway must route all requests instead
- ~7% throughput regression observed in continuous benchmarks (110M → 102M requests/hour)
Root Cause (3-step chain)
Step 1: Wrong URL encoding — .item() vs .item_by_rid()
In partition_key_range_cache.rs, get_routing_map_for_collection():
let pk_range_link = self
.database_link // dbs/perfdb
.feed(ResourceType::Containers)
.item(collection_rid) // ← BUG: .item() URL-encodes the RID
.feed(ResourceType::PartitionKeyRanges);.item() calls LinkSegment::new() which URL-encodes the value. RIDs like pLLZAIuPigw= get the = encoded to %3D:
dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges ← Cosmos DB returns 404
Should use .item_by_rid() which calls LinkSegment::identity() (no encoding):
dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct
Step 2: Error silently swallowed
In partition_key_range_cache.rs line 147, try_lookup():
Ok(routing_map.ok()) // Converts Err(404) → Ok(None), invisible to callerThe caller in container_connection.rs sees Ok(None) and skips the routing block entirely:
let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?;
if let Some(routing_map) = routing_map {
// SKIPPED — no client-side partition key range resolution
}Step 3: Errors not cached → retried on every request
AsyncCache::get() only caches successful values. When compute() returns Err, the error propagates and the cache remains empty. Every subsequent request:
- Read lock → cache miss
- Acquire write lock (serializes all concurrent operations on the same key)
- HTTP request to Cosmos DB → 404
- Error propagated, cache stays empty
- Error swallowed as
Ok(None) - Routing bypassed
Evidence from Benchmarks
Continuous benchmark on cosmos-perf-rg (4 pods, concurrency=100, 400K RU/s):
| Hour (UTC) | 404 Count | Notes |
|---|---|---|
| 13:02 – 17:02 | 3,500 – 6,300 | Normal background |
| 18:02 | 1,645,604 | After deploying commit 98d01c8 |
Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression). Server-side latency actually decreased (fewer effective requests reaching the service), confirming the bottleneck is client-side.
Suggested Fix
// partition_key_range_cache.rs, get_routing_map_for_collection()
let pk_range_link = self
.database_link
.feed(ResourceType::Containers)
- .item(collection_rid)
+ .item_by_rid(collection_rid)
.feed(ResourceType::PartitionKeyRanges);Additionally, consider:
- Logging errors in
try_lookupbefore swallowing them, to make silent failures visible - Adding negative caching (or a backoff) in
AsyncCacheto avoid retrying failed fetches on every request
Affected Version
Commit 98d01c8 on release/azure_data_cosmos-previews branch (PR #4005).
Metadata
Metadata
Labels
Type
Projects
Status
Status