Commit b6def98
Fix pk_range_cache to use .item_by_rid() for correct URL fetching (Azure#4032)
Fixes two issues in
`sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs`:
In `get_routing_map_for_collection()`, changed `.item(collection_rid)`
to `.item_by_rid(collection_rid)`. The `.item()` method URL-encodes the
value via `LinkSegment::new()`, so RIDs like `pLLZAIuPigw=` were being
encoded to `pLLZAIuPigw%3D` in the URL, causing Cosmos DB to return 404
on every partition key range fetch. The `.item_by_rid()` method uses
`LinkSegment::identity()` (no encoding), producing the correct URL:
```
dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct
```
Added `tracing::warn!` before the `.ok()` call in `try_lookup()` that
was silently swallowing errors. Routing map fetch failures now emit a
warning with the `collection_rid` and error details, making silent
failures visible in diagnostics.
Added three unit tests in `partition_key_range_cache.rs` that directly
verify the URL construction behavior:
- `pkranges_link_rid_with_equals_is_not_encoded` — verifies
`item_by_rid()` preserves `=` in the RID path (e.g., `pLLZAIuPigw=` →
`dbs/perfdb/colls/pLLZAIuPigw=/pkranges`)
- `pkranges_link_item_encodes_equals_incorrectly` — documents the bug:
`item()` encodes `=` to `%3D`, producing a path that causes 404s
- `pkranges_link_rid_with_plus_is_not_encoded` — verifies
`item_by_rid()` also preserves `+` and `/` in base64 RIDs
PR Azure#4005 changed the `pk_range_cache` key from container name to
collection RID. The URL construction code was not updated to use
`.item_by_rid()`, causing RID URL-encoding and subsequent 404s on every
pkranges fetch. Because errors were silently swallowed and `AsyncCache`
does not cache errors, this failed on every single request, resulting in
write lock contention and loss of client-side partition key routing (~7%
throughput regression observed in benchmarks).
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
----
*This section details on the original issue you should resolve*
<issue_title>[Cosmos] pk_range_cache uses .item() instead of
.item_by_rid() causing silent 404s on every request</issue_title>
<issue_description>## Bug Report
PR [Azure#4005](Azure#4005)
changed the `pk_range_cache` key from container **name** to collection
**RID**. However, the code that constructs the URL for fetching
partition key ranges still uses `.item(collection_rid)` instead of
`.item_by_rid(collection_rid)`. This causes the RID to be URL-encoded
(e.g., `=` → `%3D`), resulting in a **404 from Cosmos DB** on every
partition key range fetch attempt.
Because `try_lookup` silently swallows errors via `Ok(routing_map.ok())`
and `AsyncCache` does not cache errors, this failure repeats on **every
single request**, causing:
1. **1.6M extra 404 requests/hour** observed on a benchmark account
after deploying the change
2. **Write lock contention** in `AsyncCache` as every concurrent
operation serializes through the failed fetch path
3. **Loss of client-side partition key routing** — the gateway must
route all requests instead
4. **~7% throughput regression** observed in continuous benchmarks (110M
→ 102M requests/hour)
In `partition_key_range_cache.rs`, `get_routing_map_for_collection()`:
```rust
let pk_range_link = self
.database_link // dbs/perfdb
.feed(ResourceType::Containers)
.item(collection_rid) // ← BUG: .item() URL-encodes the RID
.feed(ResourceType::PartitionKeyRanges);
```
`.item()` calls `LinkSegment::new()` which URL-encodes the value. RIDs
like `pLLZAIuPigw=` get the `=` encoded to `%3D`:
```
dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges ← Cosmos DB returns 404
```
Should use `.item_by_rid()` which calls `LinkSegment::identity()` (no
encoding):
```
dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct
```
In `partition_key_range_cache.rs` line 147, `try_lookup()`:
```rust
Ok(routing_map.ok()) // Converts Err(404) → Ok(None), invisible to caller
```
The caller in `container_connection.rs` sees `Ok(None)` and skips the
routing block entirely:
```rust
let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?;
if let Some(routing_map) = routing_map {
// SKIPPED — no client-side partition key range resolution
}
```
`AsyncCache::get()` only caches successful values. When `compute()`
returns `Err`, the error propagates and the cache remains empty. Every
subsequent request:
1. Read lock → cache miss
2. Acquire **write lock** (serializes all concurrent operations on the
same key)
3. HTTP request to Cosmos DB → **404**
4. Error propagated, cache stays empty
5. Error swallowed as `Ok(None)`
6. Routing bypassed
Continuous benchmark on `cosmos-perf-rg` (4 pods, concurrency=100, 400K
RU/s):
| Hour (UTC) | 404 Count | Notes |
|---|---|---|
| 13:02 – 17:02 | 3,500 – 6,300 | Normal background |
| **18:02** | **1,645,604** | After deploying commit `98d01c8` |
Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression).
Server-side latency actually decreased (fewer effective requests
reaching the service), confirming the bottleneck is client-side.
```diff
// partition_key_range_cache.rs, get_routing_map_for_collection()
let pk_range_link = self
.database_link
.feed(ResourceType::Containers)
- .item(collection_rid)
+ .item_by_rid(collection_rid)
.feed(ResourceType::PartitionKeyRanges);
```
Additionally, consider:
- Logging errors in `try_lookup` before swallowing them, to make silent
failures visible
- Adding negative caching (or a backoff) in `AsyncCache` to avoid
retrying failed fetches on every request
Commit
[`98d01c8`](Azure@98d01c8)
on `release/azure_data_cosmos-previews` branch (PR
Azure#4005).
</issue_description>
<agent_instructions>
@copilot fix this issue. The target branch is
release/azure_data_cosmos-previews.
Required changes:
1. Primary fix — In
sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs,
in get_routing_map_for_collection(), change .item(collection_rid) to
.item_by_rid(collection_rid) so the collection
RID is not URL-encoded when constructing the pkranges resource link.
2. Add error logging in try_lookup — Before the .ok() on the last line
of try_lookup() in the same file, add a tracing::warn! that logs when
the routing map fetch fails, including the collection_rid and the
error. This ensures silent failures are visible in diagnostics. Example:
let routing_map = self.routing_map_cache.get(/* ...
*/).await;
if let Err(ref e) = routing_map {
tracing::warn!(
collection_rid,
...
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
- Fixes Azure#4031
<!-- START COPILOT CODING AGENT TIPS -->
---
📍 Connect Copilot coding agent with [Jira](https://gh.io/cca-jira-docs),
[Azure Boards](https://gh.io/cca-azure-boards-docs) or
[Linear](https://gh.io/cca-linear-docs) to delegate work to Copilot in
one click without leaving your project management tool.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>1 parent e40336b commit b6def98
File tree
2 files changed
+67
-1
lines changed- sdk/cosmos
- azure_data_cosmos/src/routing
2 files changed
+67
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
73 | 73 | | |
74 | 74 | | |
75 | 75 | | |
| 76 | + | |
76 | 77 | | |
77 | 78 | | |
78 | 79 | | |
| |||
105 | 106 | | |
106 | 107 | | |
107 | 108 | | |
| 109 | + | |
108 | 110 | | |
109 | 111 | | |
110 | 112 | | |
| |||
Lines changed: 65 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
144 | 144 | | |
145 | 145 | | |
146 | 146 | | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
147 | 154 | | |
148 | 155 | | |
149 | 156 | | |
| |||
189 | 196 | | |
190 | 197 | | |
191 | 198 | | |
192 | | - | |
| 199 | + | |
193 | 200 | | |
194 | 201 | | |
195 | 202 | | |
| |||
673 | 680 | | |
674 | 681 | | |
675 | 682 | | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
676 | 740 | | |
0 commit comments