Skip to content

Commit b6def98

Browse files
Copilottvaron3
andcommitted
Fix pk_range_cache to use .item_by_rid() for correct URL fetching (Azure#4032)
Fixes two issues in `sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs`: In `get_routing_map_for_collection()`, changed `.item(collection_rid)` to `.item_by_rid(collection_rid)`. The `.item()` method URL-encodes the value via `LinkSegment::new()`, so RIDs like `pLLZAIuPigw=` were being encoded to `pLLZAIuPigw%3D` in the URL, causing Cosmos DB to return 404 on every partition key range fetch. The `.item_by_rid()` method uses `LinkSegment::identity()` (no encoding), producing the correct URL: ``` dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct ``` Added `tracing::warn!` before the `.ok()` call in `try_lookup()` that was silently swallowing errors. Routing map fetch failures now emit a warning with the `collection_rid` and error details, making silent failures visible in diagnostics. Added three unit tests in `partition_key_range_cache.rs` that directly verify the URL construction behavior: - `pkranges_link_rid_with_equals_is_not_encoded` — verifies `item_by_rid()` preserves `=` in the RID path (e.g., `pLLZAIuPigw=` → `dbs/perfdb/colls/pLLZAIuPigw=/pkranges`) - `pkranges_link_item_encodes_equals_incorrectly` — documents the bug: `item()` encodes `=` to `%3D`, producing a path that causes 404s - `pkranges_link_rid_with_plus_is_not_encoded` — verifies `item_by_rid()` also preserves `+` and `/` in base64 RIDs PR Azure#4005 changed the `pk_range_cache` key from container name to collection RID. The URL construction code was not updated to use `.item_by_rid()`, causing RID URL-encoding and subsequent 404s on every pkranges fetch. Because errors were silently swallowed and `AsyncCache` does not cache errors, this failed on every single request, resulting in write lock contention and loss of client-side partition key routing (~7% throughput regression observed in benchmarks). <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> ---- *This section details on the original issue you should resolve* <issue_title>[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request</issue_title> <issue_description>## Bug Report PR [Azure#4005](Azure#4005) changed the `pk_range_cache` key from container **name** to collection **RID**. However, the code that constructs the URL for fetching partition key ranges still uses `.item(collection_rid)` instead of `.item_by_rid(collection_rid)`. This causes the RID to be URL-encoded (e.g., `=` → `%3D`), resulting in a **404 from Cosmos DB** on every partition key range fetch attempt. Because `try_lookup` silently swallows errors via `Ok(routing_map.ok())` and `AsyncCache` does not cache errors, this failure repeats on **every single request**, causing: 1. **1.6M extra 404 requests/hour** observed on a benchmark account after deploying the change 2. **Write lock contention** in `AsyncCache` as every concurrent operation serializes through the failed fetch path 3. **Loss of client-side partition key routing** — the gateway must route all requests instead 4. **~7% throughput regression** observed in continuous benchmarks (110M → 102M requests/hour) In `partition_key_range_cache.rs`, `get_routing_map_for_collection()`: ```rust let pk_range_link = self .database_link // dbs/perfdb .feed(ResourceType::Containers) .item(collection_rid) // ← BUG: .item() URL-encodes the RID .feed(ResourceType::PartitionKeyRanges); ``` `.item()` calls `LinkSegment::new()` which URL-encodes the value. RIDs like `pLLZAIuPigw=` get the `=` encoded to `%3D`: ``` dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges ← Cosmos DB returns 404 ``` Should use `.item_by_rid()` which calls `LinkSegment::identity()` (no encoding): ``` dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct ``` In `partition_key_range_cache.rs` line 147, `try_lookup()`: ```rust Ok(routing_map.ok()) // Converts Err(404) → Ok(None), invisible to caller ``` The caller in `container_connection.rs` sees `Ok(None)` and skips the routing block entirely: ```rust let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?; if let Some(routing_map) = routing_map { // SKIPPED — no client-side partition key range resolution } ``` `AsyncCache::get()` only caches successful values. When `compute()` returns `Err`, the error propagates and the cache remains empty. Every subsequent request: 1. Read lock → cache miss 2. Acquire **write lock** (serializes all concurrent operations on the same key) 3. HTTP request to Cosmos DB → **404** 4. Error propagated, cache stays empty 5. Error swallowed as `Ok(None)` 6. Routing bypassed Continuous benchmark on `cosmos-perf-rg` (4 pods, concurrency=100, 400K RU/s): | Hour (UTC) | 404 Count | Notes | |---|---|---| | 13:02 – 17:02 | 3,500 – 6,300 | Normal background | | **18:02** | **1,645,604** | After deploying commit `98d01c8` | Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression). Server-side latency actually decreased (fewer effective requests reaching the service), confirming the bottleneck is client-side. ```diff // partition_key_range_cache.rs, get_routing_map_for_collection() let pk_range_link = self .database_link .feed(ResourceType::Containers) - .item(collection_rid) + .item_by_rid(collection_rid) .feed(ResourceType::PartitionKeyRanges); ``` Additionally, consider: - Logging errors in `try_lookup` before swallowing them, to make silent failures visible - Adding negative caching (or a backoff) in `AsyncCache` to avoid retrying failed fetches on every request Commit [`98d01c8`](Azure@98d01c8) on `release/azure_data_cosmos-previews` branch (PR Azure#4005). </issue_description> <agent_instructions> @copilot fix this issue. The target branch is release/azure_data_cosmos-previews. Required changes: 1. Primary fix — In sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs, in get_routing_map_for_collection(), change .item(collection_rid) to .item_by_rid(collection_rid) so the collection RID is not URL-encoded when constructing the pkranges resource link. 2. Add error logging in try_lookup — Before the .ok() on the last line of try_lookup() in the same file, add a tracing::warn! that logs when the routing map fetch fails, including the collection_rid and the error. This ensures silent failures are visible in diagnostics. Example: let routing_map = self.routing_map_cache.get(/* ... */).await; if let Err(ref e) = routing_map { tracing::warn!( collection_rid, ... </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes Azure#4031 <!-- START COPILOT CODING AGENT TIPS --> --- 📍 Connect Copilot coding agent with [Jira](https://gh.io/cca-jira-docs), [Azure Boards](https://gh.io/cca-azure-boards-docs) or [Linear](https://gh.io/cca-linear-docs) to delegate work to Copilot in one click without leaving your project management tool. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>
1 parent e40336b commit b6def98

File tree

2 files changed

+67
-1
lines changed

2 files changed

+67
-1
lines changed

sdk/cosmos/.cspell.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@
7373
"Kusto",
7474
"libazurecosmos",
7575
"linearizability",
76+
"LLZA",
7677
"makefiles",
7778
"MEMORYSTATUSEX",
7879
"moka",
@@ -105,6 +106,7 @@
105106
"perfdb",
106107
"perfresults",
107108
"PIDS",
109+
"Pigw",
108110
"Pkrange",
109111
"PKRANGE",
110112
"pkranges",

sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs

Lines changed: 65 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,13 @@ impl PartitionKeyRangeCache {
144144
)
145145
.await;
146146

147+
if let Err(ref e) = routing_map {
148+
tracing::warn!(
149+
collection_rid,
150+
error = %e,
151+
"Failed to fetch routing map for collection"
152+
);
153+
}
147154
Ok(routing_map.ok())
148155
}
149156

@@ -189,7 +196,7 @@ impl PartitionKeyRangeCache {
189196
let pk_range_link = self
190197
.database_link
191198
.feed(ResourceType::Containers)
192-
.item(collection_rid)
199+
.item_by_rid(collection_rid)
193200
.feed(ResourceType::PartitionKeyRanges);
194201
let response = self
195202
.execute_partition_key_range_read_change_feed(
@@ -673,4 +680,61 @@ mod tests {
673680
assert!(range.target_throughput.is_some());
674681
assert_eq!(range.target_throughput.unwrap(), 1000.0);
675682
}
683+
684+
// Tests verifying that the pkranges resource link uses item_by_rid() so that
685+
// collection RIDs (which are base64-encoded and can contain '=', '+', '/') are
686+
// not URL-percent-encoded. Using item() would encode '=' to '%3D', causing 404s.
687+
688+
#[test]
689+
fn pkranges_link_rid_with_equals_is_not_encoded() {
690+
// RIDs like "pLLZAIuPigw=" contain '=' which item() would encode to '%3D'.
691+
// item_by_rid() must preserve it as-is.
692+
let collection_rid = "pLLZAIuPigw=";
693+
let database_link = ResourceLink::root(ResourceType::Databases).item("perfdb");
694+
let pk_range_link = database_link
695+
.feed(ResourceType::Containers)
696+
.item_by_rid(collection_rid)
697+
.feed(ResourceType::PartitionKeyRanges);
698+
699+
// Correct: '=' preserved, not encoded to '%3D'
700+
assert_eq!(
701+
"dbs/perfdb/colls/pLLZAIuPigw=/pkranges",
702+
pk_range_link.path()
703+
);
704+
}
705+
706+
#[test]
707+
fn pkranges_link_item_encodes_equals_incorrectly() {
708+
// Demonstrates the bug: item() URL-encodes '=' to '%3D', producing a path
709+
// that Cosmos DB cannot find (404).
710+
let collection_rid = "pLLZAIuPigw=";
711+
let database_link = ResourceLink::root(ResourceType::Databases).item("perfdb");
712+
let pk_range_link_wrong = database_link
713+
.feed(ResourceType::Containers)
714+
.item(collection_rid)
715+
.feed(ResourceType::PartitionKeyRanges);
716+
717+
// Wrong: '=' is encoded to '%3D', causing 404 from Cosmos DB
718+
assert!(
719+
pk_range_link_wrong.path().contains("%3D"),
720+
"item() should URL-encode '=' to '%3D'"
721+
);
722+
assert_eq!(
723+
"dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges",
724+
pk_range_link_wrong.path()
725+
);
726+
}
727+
728+
#[test]
729+
fn pkranges_link_rid_with_plus_is_not_encoded() {
730+
// RIDs may also contain '+' (base64 char). item_by_rid() must preserve it.
731+
let collection_rid = "AB+CD/EF==";
732+
let database_link = ResourceLink::root(ResourceType::Databases).item("mydb");
733+
let pk_range_link = database_link
734+
.feed(ResourceType::Containers)
735+
.item_by_rid(collection_rid)
736+
.feed(ResourceType::PartitionKeyRanges);
737+
738+
assert_eq!("dbs/mydb/colls/AB+CD/EF==/pkranges", pk_range_link.path());
739+
}
676740
}

0 commit comments

Comments
 (0)