Skip to content

[Encryption]: Avoiding contention on EncryptionSettingForProperty. #5641

@jeet1995

Description

@jeet1995

Client-Side Encryption: OperationCanceledException from Global Semaphore Contention

Summary

Customer-side encryption operations throw OperationCanceledException under concurrent load. Threads queue on a global SemaphoreSlim(1,1) while one thread holds it doing synchronous Key Vault HTTP calls. Queued threads' cancellation tokens fire → exception. The root cause is three cascading design issues in Microsoft.Azure.Cosmos.Encryption.

Symptom

System.OperationCanceledException: The operation was canceled.
  at System.Threading.CancellationToken.ThrowOperationCanceledException()
  at System.Threading.SemaphoreSlim.WaitUntilCountOrTimeoutAsync(...)
  at Microsoft.Azure.Cosmos.Encryption.EncryptionSettingForProperty.BuildProtectedDataEncryptionKeyAsync(...)
  at Microsoft.Azure.Cosmos.Encryption.EncryptionSettingForProperty.BuildEncryptionAlgorithmForSettingAsync(...)
  ...
  at Microsoft.Azure.Cosmos.Encryption.EncryptionFeedIterator.ReadNextAsync(...)

The exception fires at SemaphoreSlim.WaitUntilCountOrTimeoutAsync — a pure in-memory wait, not I/O. The customer's threads are NOT doing I/O when the exception fires. They are queued behind another thread that is.


Root Cause

Three cascading issues combine to create the problem:

Issue 1: Global SemaphoreSlim(1, 1)

File: EncryptionCosmosClient.cs line 20

internal static readonly SemaphoreSlim EncryptionKeyCacheSemaphore = new SemaphoreSlim(1, 1);
  • static — shared across ALL EncryptionCosmosClient instances in the app domain
  • Count = 1 — only one thread at a time
  • Guards ALL traffic — even instant cache hits must acquire this semaphore

Issue 2: DEK Byte Cache Disabled

File: EncryptionKeyStoreProviderImpl.cs line 34

this.DataEncryptionKeyCacheTimeToLive = TimeSpan.Zero;  // cache DISABLED

The base class EncryptionKeyStoreProvider (from MDE) has a LocalCache<string, byte[]> with a default 2-hour TTL for unwrapped DEK bytes. This line disables it entirely. Every UnwrapKey call always misses the L3 cache and calls Key Vault.

Issue 3: Two Synchronous Key Vault HTTP Calls Under the Semaphore

File: EncryptionKeyStoreProviderImpl.cs lines 53–57

byte[] UnWrapKeyCore()
{
    return this.keyEncryptionKeyResolver
        .Resolve(encryptionKeyId)              // I/O #1: sync HTTP GET to Key Vault (resolve CMK → CryptographyClient)
        .UnwrapKey(algorithm, encryptedKey);   // I/O #2: sync HTTP POST to Key Vault (RSA-OAEP unwrap)
}

Both use synchronous overloads (not ResolveAsync/UnwrapKeyAsync) because MDE's ProtectedDataEncryptionKey constructor is sync (base class initializer chain — structurally impossible to make async). UnwrapKey must be a remote call — the RSA private key never leaves the Key Vault HSM.

Call chain on PDEK cache miss (inside the semaphore):

BuildProtectedDataEncryptionKeyAsync
  └─ await SemaphoreSlim.WaitAsync(-1, cancellationToken)   ← other threads queue here
       └─ ProtectedDataEncryptionKey.GetOrCreate(...)        ← sync
            └─ new ProtectedDataEncryptionKey(name, kek, wrappedDEK)
                 └─ kek.DecryptEncryptionKey(wrappedDEK)     ← sync
                      └─ EncryptionKeyStoreProviderImpl.UnwrapKey(...)
                           └─ GetOrCreateDataEncryptionKey(hexKey, UnWrapKeyCore)
                                └─ TTL = 0 → ALWAYS executes UnWrapKeyCore
                                     └─ Resolve(keyId)       ← sync HTTP #1 (100–500ms)
                                     └─ UnwrapKey(alg, key)  ← sync HTTP #2 (100–5000ms)

Result: Semaphore held for 200ms–5s+ while all other threads are blocked.


Cache Layer Analysis

Layer What it caches TTL Effective?
L1: Algorithm (AeadAes256CbcHmac256EncryptionAlgorithm) Encryption algorithm instance None — new every call No caching
L2: PDEK (ProtectedDataEncryptionKey.GetOrCreate) Protected data encryption key (unwrapped DEK + KEK ref) 1–2 hours Yes — but gated behind global semaphore
L3: DEK bytes (EncryptionKeyStoreProvider.GetOrCreateDataEncryptionKey) Raw unwrapped AES-256 key bytes TimeSpan.Zero Disabled — always misses, always calls Key Vault
  • L2 PDEK hit: semaphore held for microseconds (in-memory dictionary lookup)
  • L2 PDEK miss: semaphore held for 200ms–5s+ (two Key Vault HTTP calls)
  • L3 could prevent Key Vault calls on L2 miss, but it's disabled

Concrete Failure Scenario

  1. PDEK cache TTL expires (every 1–2 hours)
  2. Thread A acquires semaphore → calls Resolve() + UnwrapKey() → synchronous Key Vault HTTP (100ms–5s+)
  3. Threads B, C, D queue on SemaphoreSlim.WaitAsync(-1, cancellationToken) — pure in-memory wait, no I/O
  4. Key Vault is slow (throttling, network latency, cold start)
  5. CancellationToken fires on queued threads (request timeout, Change Feed Processor lease rebalance, etc.)
  6. OperationCanceledException at SemaphoreSlim.WaitUntilCountOrTimeoutAsync

References

Source files:

Tests:

Documentation:

External packages:

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions