[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request

## Bug Report

### Summary

PR [#4005](https://github.com/Azure/azure-sdk-for-rust/pull/4005) changed the `pk_range_cache` key from container **name** to collection **RID**. However, the code that constructs the URL for fetching partition key ranges still uses `.item(collection_rid)` instead of `.item_by_rid(collection_rid)`. This causes the RID to be URL-encoded (e.g., `=` → `%3D`), resulting in a **404 from Cosmos DB** on every partition key range fetch attempt.

Because `try_lookup` silently swallows errors via `Ok(routing_map.ok())` and `AsyncCache` does not cache errors, this failure repeats on **every single request**, causing:

1. **1.6M extra 404 requests/hour** observed on a benchmark account after deploying the change
2. **Write lock contention** in `AsyncCache` as every concurrent operation serializes through the failed fetch path
3. **Loss of client-side partition key routing** — the gateway must route all requests instead
4. **~7% throughput regression** observed in continuous benchmarks (110M → 102M requests/hour)

### Root Cause (3-step chain)

#### Step 1: Wrong URL encoding — `.item()` vs `.item_by_rid()`

In `partition_key_range_cache.rs`, `get_routing_map_for_collection()`:

```rust
let pk_range_link = self
    .database_link                       // dbs/perfdb
    .feed(ResourceType::Containers)
    .item(collection_rid)                // ← BUG: .item() URL-encodes the RID
    .feed(ResourceType::PartitionKeyRanges);
```

`.item()` calls `LinkSegment::new()` which URL-encodes the value. RIDs like `pLLZAIuPigw=` get the `=` encoded to `%3D`:
```
dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges  ← Cosmos DB returns 404
```

Should use `.item_by_rid()` which calls `LinkSegment::identity()` (no encoding):
```
dbs/perfdb/colls/pLLZAIuPigw=/pkranges    ← correct
```

#### Step 2: Error silently swallowed

In `partition_key_range_cache.rs` line 147, `try_lookup()`:

```rust
Ok(routing_map.ok())  // Converts Err(404) → Ok(None), invisible to caller
```

The caller in `container_connection.rs` sees `Ok(None)` and skips the routing block entirely:

```rust
let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?;
if let Some(routing_map) = routing_map {
    // SKIPPED — no client-side partition key range resolution
}
```

#### Step 3: Errors not cached → retried on every request

`AsyncCache::get()` only caches successful values. When `compute()` returns `Err`, the error propagates and the cache remains empty. Every subsequent request:

1. Read lock → cache miss
2. Acquire **write lock** (serializes all concurrent operations on the same key)
3. HTTP request to Cosmos DB → **404**
4. Error propagated, cache stays empty
5. Error swallowed as `Ok(None)`
6. Routing bypassed

### Evidence from Benchmarks

Continuous benchmark on `cosmos-perf-rg` (4 pods, concurrency=100, 400K RU/s):

| Hour (UTC) | 404 Count | Notes |
|---|---|---|
| 13:02 – 17:02 | 3,500 – 6,300 | Normal background |
| **18:02** | **1,645,604** | After deploying commit `98d01c8` |

Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression). Server-side latency actually decreased (fewer effective requests reaching the service), confirming the bottleneck is client-side.

### Suggested Fix

```diff
// partition_key_range_cache.rs, get_routing_map_for_collection()
let pk_range_link = self
    .database_link
    .feed(ResourceType::Containers)
-   .item(collection_rid)
+   .item_by_rid(collection_rid)
    .feed(ResourceType::PartitionKeyRanges);
```

Additionally, consider:
- Logging errors in `try_lookup` before swallowing them, to make silent failures visible
- Adding negative caching (or a backoff) in `AsyncCache` to avoid retrying failed fetches on every request

### Affected Version

Commit [`98d01c8`](https://github.com/Azure/azure-sdk-for-rust/commit/98d01c8) on `release/azure_data_cosmos-previews` branch (PR #4005).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request #4031

Bug Report

Summary

Root Cause (3-step chain)

Step 1: Wrong URL encoding — `.item()` vs `.item_by_rid()`

Step 2: Error silently swallowed

Step 3: Errors not cached → retried on every request

Evidence from Benchmarks

Suggested Fix

Affected Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hour (UTC)	404 Count	Notes
13:02 – 17:02	3,500 – 6,300	Normal background
18:02	1,645,604	After deploying commit `98d01c8`

[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request #4031

Description

Bug Report

Summary

Root Cause (3-step chain)

Step 1: Wrong URL encoding — .item() vs .item_by_rid()

Step 2: Error silently swallowed

Step 3: Errors not cached → retried on every request

Evidence from Benchmarks

Suggested Fix

Affected Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Step 1: Wrong URL encoding — `.item()` vs `.item_by_rid()`