-
Notifications
You must be signed in to change notification settings - Fork 528
[Encryption]: Avoiding contention on EncryptionSettingForProperty. #5641
Description
Client-Side Encryption: OperationCanceledException from Global Semaphore Contention
Summary
Customer-side encryption operations throw OperationCanceledException under concurrent load. Threads queue on a global SemaphoreSlim(1,1) while one thread holds it doing synchronous Key Vault HTTP calls. Queued threads' cancellation tokens fire → exception. The root cause is three cascading design issues in Microsoft.Azure.Cosmos.Encryption.
Symptom
System.OperationCanceledException: The operation was canceled.
at System.Threading.CancellationToken.ThrowOperationCanceledException()
at System.Threading.SemaphoreSlim.WaitUntilCountOrTimeoutAsync(...)
at Microsoft.Azure.Cosmos.Encryption.EncryptionSettingForProperty.BuildProtectedDataEncryptionKeyAsync(...)
at Microsoft.Azure.Cosmos.Encryption.EncryptionSettingForProperty.BuildEncryptionAlgorithmForSettingAsync(...)
...
at Microsoft.Azure.Cosmos.Encryption.EncryptionFeedIterator.ReadNextAsync(...)
The exception fires at SemaphoreSlim.WaitUntilCountOrTimeoutAsync — a pure in-memory wait, not I/O. The customer's threads are NOT doing I/O when the exception fires. They are queued behind another thread that is.
Root Cause
Three cascading issues combine to create the problem:
Issue 1: Global SemaphoreSlim(1, 1)
File: EncryptionCosmosClient.cs line 20
internal static readonly SemaphoreSlim EncryptionKeyCacheSemaphore = new SemaphoreSlim(1, 1);static— shared across ALLEncryptionCosmosClientinstances in the app domain- Count = 1 — only one thread at a time
- Guards ALL traffic — even instant cache hits must acquire this semaphore
Issue 2: DEK Byte Cache Disabled
File: EncryptionKeyStoreProviderImpl.cs line 34
this.DataEncryptionKeyCacheTimeToLive = TimeSpan.Zero; // cache DISABLEDThe base class EncryptionKeyStoreProvider (from MDE) has a LocalCache<string, byte[]> with a default 2-hour TTL for unwrapped DEK bytes. This line disables it entirely. Every UnwrapKey call always misses the L3 cache and calls Key Vault.
Issue 3: Two Synchronous Key Vault HTTP Calls Under the Semaphore
File: EncryptionKeyStoreProviderImpl.cs lines 53–57
byte[] UnWrapKeyCore()
{
return this.keyEncryptionKeyResolver
.Resolve(encryptionKeyId) // I/O #1: sync HTTP GET to Key Vault (resolve CMK → CryptographyClient)
.UnwrapKey(algorithm, encryptedKey); // I/O #2: sync HTTP POST to Key Vault (RSA-OAEP unwrap)
}Both use synchronous overloads (not ResolveAsync/UnwrapKeyAsync) because MDE's ProtectedDataEncryptionKey constructor is sync (base class initializer chain — structurally impossible to make async). UnwrapKey must be a remote call — the RSA private key never leaves the Key Vault HSM.
Call chain on PDEK cache miss (inside the semaphore):
BuildProtectedDataEncryptionKeyAsync
└─ await SemaphoreSlim.WaitAsync(-1, cancellationToken) ← other threads queue here
└─ ProtectedDataEncryptionKey.GetOrCreate(...) ← sync
└─ new ProtectedDataEncryptionKey(name, kek, wrappedDEK)
└─ kek.DecryptEncryptionKey(wrappedDEK) ← sync
└─ EncryptionKeyStoreProviderImpl.UnwrapKey(...)
└─ GetOrCreateDataEncryptionKey(hexKey, UnWrapKeyCore)
└─ TTL = 0 → ALWAYS executes UnWrapKeyCore
└─ Resolve(keyId) ← sync HTTP #1 (100–500ms)
└─ UnwrapKey(alg, key) ← sync HTTP #2 (100–5000ms)
Result: Semaphore held for 200ms–5s+ while all other threads are blocked.
Cache Layer Analysis
| Layer | What it caches | TTL | Effective? |
|---|---|---|---|
L1: Algorithm (AeadAes256CbcHmac256EncryptionAlgorithm) |
Encryption algorithm instance | None — new every call | No caching |
L2: PDEK (ProtectedDataEncryptionKey.GetOrCreate) |
Protected data encryption key (unwrapped DEK + KEK ref) | 1–2 hours | Yes — but gated behind global semaphore |
L3: DEK bytes (EncryptionKeyStoreProvider.GetOrCreateDataEncryptionKey) |
Raw unwrapped AES-256 key bytes | TimeSpan.Zero |
Disabled — always misses, always calls Key Vault |
- L2 PDEK hit: semaphore held for microseconds (in-memory dictionary lookup)
- L2 PDEK miss: semaphore held for 200ms–5s+ (two Key Vault HTTP calls)
- L3 could prevent Key Vault calls on L2 miss, but it's disabled
Concrete Failure Scenario
- PDEK cache TTL expires (every 1–2 hours)
- Thread A acquires semaphore → calls
Resolve()+UnwrapKey()→ synchronous Key Vault HTTP (100ms–5s+) - Threads B, C, D queue on
SemaphoreSlim.WaitAsync(-1, cancellationToken)— pure in-memory wait, no I/O - Key Vault is slow (throttling, network latency, cold start)
CancellationTokenfires on queued threads (request timeout, Change Feed Processor lease rebalance, etc.)- →
OperationCanceledExceptionatSemaphoreSlim.WaitUntilCountOrTimeoutAsync
References
Source files:
EncryptionCosmosClient.cs— globalstatic SemaphoreSlim(1,1)at line 20EncryptionKeyStoreProviderImpl.cs—TimeSpan.Zeroat line 34; 2× sync Key Vault calls inUnWrapKeyCoreat lines 53–57EncryptionSettingForProperty.cs— semaphore acquisition inBuildProtectedDataEncryptionKeyAsync
Tests:
MdeEncryptionTests.cs—ValidateCachingOfProtectedDataEncryptionKey
Documentation:
External packages:
Azure.Security.KeyVault.Keys—KeyResolver,CryptographyClientAzure.Core.Cryptography—IKeyEncryptionKeyResolver,IKeyEncryptionKeyMicrosoft.Data.Encryption.Cryptography(MDE) —ProtectedDataEncryptionKey,KeyEncryptionKey,EncryptionKeyStoreProvider,LocalCache