-
Notifications
You must be signed in to change notification settings - Fork 528
PartitionKeyRangeCache uses database RID instead of container RID in pkranges requests #5734
Description
Summary
The .NET CosmosDB SDK's PartitionKeyRangeCache intermittently sends partition key range requests using the database RID as both dbId and collId in the URL, while the authorization signature is computed using the real container RID.
Symptom
The SDK sends requests like:
GET /dbs/{dbRid}/colls/{dbRid}/pkranges
Where the same 4-byte database RID (e.g., vCHWYA==) appears in both the dbId and collId positions. However, the Authorization header is signed using the actual 8-byte container RID (e.g., vCHWYNa97mg=, lowercased in the string-to-sign).
This causes authentication failures on servers that derive the resource link from the URL path, because the URL and the signature disagree on the resource identity.
How I discovered this bug
I created a simplified cosmosdb emulator for local dev purpose. When I test my C# code against this local dev cosmosdb I noticed the sdk is sending incorrect GET pkranges request, e.g. GET /dbs/xEuq5w==/colls/xEuq5w==/pkranges
Here's the raw log I collected from my cosmosdb emulator, it clearly shows two attempts to create container and then read a document. In the first attempt the SDK is sending /dbs/Onmh3w==/colls/Onmh3w==/pkranges request which resulted 401, that's a repro of the issue. In the second attempt the SDK is sending the correct request /dbs/Onmh3w==/colls/Onmh398Cf90=/pkranges.
026/04/04 08:48:10.196113 Starting HTTPS server on :8081
2026/04/04 08:48:17.392320 AUTH OK GET / resType="" failedLinks=[]
2026/04/04 08:48:17.392320 GET / → 200 (0s)
2026/04/04 08:48:17.488475 AUTH OK DELETE /dbs/RaceConditionTestDb resType="dbs" failedLinks=[]
2026/04/04 08:48:17.488475 DELETE /dbs/RaceConditionTestDb → 404 (0s)
2026/04/04 08:48:17.521252 AUTH OK POST /dbs resType="dbs" failedLinks=[]
2026/04/04 08:48:17.521876 POST /dbs → 201 (624.1µs)
2026/04/04 08:48:17.538465 AUTH OK POST /dbs/RaceConditionTestDb/colls resType="colls" failedLinks=[]
2026/04/04 08:48:17.539331 POST /dbs/RaceConditionTestDb/colls → 201 (866.2µs)
2026/04/04 08:48:17.566253 AUTH OK GET /dbs/RaceConditionTestDb/colls/TestContainer-1 resType="colls" fa
iledLinks=[]
2026/04/04 08:48:17.567485 GET /dbs/RaceConditionTestDb/colls/TestContainer-1 → 200 (1.2316ms)
2026/04/04 08:48:17.577545 RACE-DIAG: pkranges auth failed with URL-derived ridLink="onmh3w==" err=invalid s
ignature (verb="get" resType="pkranges" resLink="onmh3w==" date="Fri, 03 Apr 2026 21:48:17 GMT" clientSig="X
ZNl7H/gAkPaxVvgU9ifocn52V6tHC3VyPWwyhSDCas=" expectedSig="lwkEsdLUsktlO/9eLIUAzAlk8pQhzj/0DoGPIfKhacA=")
2026/04/04 08:48:17.577545 RACE-DIAG: *** RACE CONDITION DETECTED ***
2026/04/04 08:48:17.578051 RACE-DIAG: URL collId: Onmh3w== (database RID)
2026/04/04 08:48:17.578051 RACE-DIAG: Signed with: onmh3yeuq40= (container "TestContainer-1" RID)
2026/04/04 08:48:17.578051 RACE-DIAG: The SDK signed with a container RID that differs from the URL path
2026/04/04 08:48:17.578652 AUTH FAILED GET /dbs/Onmh3w==/colls/Onmh3w==/pkranges resType="pkranges" failedLi
nks=[onmh3w==]
2026/04/04 08:48:17.578652 GET /dbs/Onmh3w==/colls/Onmh3w==/pkranges → 401 (3.0583ms)
2026/04/04 08:48:17.598356 AUTH OK DELETE /dbs/RaceConditionTestDb/colls/TestContainer-1 resType="colls"
failedLinks=[]
2026/04/04 08:48:17.598356 DELETE /dbs/RaceConditionTestDb/colls/TestContainer-1 → 204 (0s)
2026/04/04 08:48:17.599668 AUTH OK POST /dbs/RaceConditionTestDb/colls resType="colls" failedLinks=[]
2026/04/04 08:48:17.599668 POST /dbs/RaceConditionTestDb/colls → 201 (0s)
2026/04/04 08:48:17.600173 AUTH OK GET /dbs/RaceConditionTestDb/colls/TestContainer-2 resType="colls" fa
iledLinks=[]
2026/04/04 08:48:17.600173 GET /dbs/RaceConditionTestDb/colls/TestContainer-2 → 200 (0s)
2026/04/04 08:48:17.600869 AUTH OK GET /dbs/Onmh3w==/colls/Onmh398Cf90=/pkranges resType="pkranges" fail
edLinks=[]
Repro
- Create a new database
- Create a container in that database
- Immediately read documents in the container
- Observe that the pkranges request sometimes uses the database RID instead of the container RID in the URL
I pushed my c# repro code into a github repo: https://github.com/renshao/cosmosdb-lite/tree/race-repro
To use it:
- Clone above repo and checkout
race-reprobranch - On repo root dir, run
go buildthen.\cosmosdb-lite.exeto launch the emulator - cd into
RaceConditionReproand rundotnet run - Watch output from both windows, the repro program will stop after 3 race conditions have been observed.
Expected behavior
Immediately after a container is created, SDK always send correct pkranges request to server.
Actual behavior
Intermittently SDK sends incorrect GET requests for pkranges, e.g. /dbs/xEuq5w==/colls/xEuq5w==/pkranges, the dbRid is used in the place for containerRid.
Affected Version
Microsoft.Azure.CosmosSDK v3.58.0 (.NET)- Windows 11
Root Cause Analysis
Call chain
-
GatewayStoreModel.TryResolvePartitionKeyRangeAsync(GatewayStoreModel.cs:429-507)- Calls
clientCollectionCache.ResolveCollectionAsync(...)to get container info - Passes
collection.ResourceIdtoPartitionKeyRangeCache.TryLookupAsync()
- Calls
-
CollectionCache.ResolveCollectionAsync(CollectionCache.cs:128-195)- For RID-based paths, parses the request's
ResourceIdviaResourceId.Parse() - Extracts
resourceIdParsed.DocumentCollectionId.ToString()as the collection RID - For name-based paths, sets
request.ResourceId = collectionInfo.ResourceId
- For RID-based paths, parses the request's
-
PartitionKeyRangeCache.ExecutePartitionKeyRangeReadChangeFeedAsync(PartitionKeyRangeCache.cs:269-326)- Creates the HTTP request using the received
collectionRidfor both path segments:DocumentServiceRequest.Create( OperationType.ReadFeed, collectionRid, // ← used as BOTH dbId and collId in URL ResourceType.PartitionKeyRange, AuthorizationTokenType.PrimaryMasterKey, headers)
- The auth signature is computed from
DocumentServiceRequest.ResourceAddress, which is the samecollectionRidvalue
- Creates the HTTP request using the received
Suspicious code path
CollectionCache.RefreshAsync (CollectionCache.cs:309-345) uses request.RequestContext.ResolvedCollectionRid as a placeholder value. If this RID is stale or was never properly set (e.g., the container was just created and the cache hasn't been populated yet), the wrong RID can propagate through the system.
The PartitionKeyRangeCache blindly trusts whatever collectionRid it receives — it uses it for both the URL construction and the auth signature. If CollectionCache returns the database RID as collection.ResourceId before the cache is fully warmed, the pkranges request will use the database RID in the URL.
Evidence
- Database RID:
vCHWYA==(4 bytes:[188, 33, 214, 96]) - Container RID:
vCHWYNa97mg=(8 bytes:[188, 33, 214, 96, 214, 189, 238, 104]) - The container RID correctly starts with the database RID bytes (hierarchical structure)
- The pkranges URL used
vCHWYA==(database RID) in both positions - The auth signature was computed using the container RID (matched via brute-force verification against all containers in the database)
Key files
| File | Lines | Description |
|---|---|---|
src/Routing/PartitionKeyRangeCache.cs |
269-326 | URL construction using collectionRid for both path segments |
src/Routing/PartitionKeyRangeCache.cs |
121-137 | TryLookupAsync passes collectionRid to routing map |
src/Routing/CollectionCache.cs |
128-195 | ResolveCollectionAsync resolves container RID |
src/Routing/CollectionCache.cs |
309-345 | RefreshAsync — can reuse stale ResolvedCollectionRid |
src/GatewayStoreModel.cs |
429-507 | TryResolvePartitionKeyRangeAsync bridges cache → pkranges |
src/Authorization/AuthorizationTokenProviderMasterKey.cs |
60-112 | Signs using request.ResourceAddress directly |