Skip to content

[Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode#5644

Open
NaluTripician wants to merge 15 commits intomasterfrom
users/nalutripician/diagnostics-compaction-spec
Open

[Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode#5644
NaluTripician wants to merge 15 commits intomasterfrom
users/nalutripician/diagnostics-compaction-spec

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented Feb 26, 2026

Summary

Adds a spec sheet for the diagnostics compaction feature that introduces a DiagnosticsVerbosity option (Detailed vs Summary) to reduce unbounded diagnostics output size during high-retry scenarios (429 throttling, transient failures, cross-region failovers).

Spec location: specs/diagnostics-compaction.md

Key Design Decisions

  • Summary mode groups requests by region, keeps first/last in full detail, deduplicates middle requests with aggregate stats (count, RU, min/max/P50/avg latency)
  • Default is Detailed — no behavioral change for existing users
  • Configuration: CosmosClientOptions.DiagnosticsVerbosity + per-request RequestOptions.DiagnosticsVerbosity override + env vars
  • Max summary size: 8KB default (4KB min), with truncated fallback
  • Compaction is serialization-only — in-memory ITrace tree is unchanged

Reference

Modeled after the Rust SDK's approach: Azure/azure-sdk-for-rust#3592

Work Items

# Work Item Description
1 Enum & Options Plumbing DiagnosticsVerbosity enum, properties on CosmosClientOptions/RequestOptions, env var support
2 Summary Computation Engine Trace tree walk, region grouping, first/last/aggregated groups
3 Serialization Integration Branch ToString() on verbosity, size enforcement, truncated fallback
4 Contract Updates Update ContractEnforcementTests baselines
5 Unit Tests 21 tests covering summary engine, verbosity flow, edge cases
6 Integration Tests Emulator-based tests for real operations with summary mode
7 Baseline Tests Golden-file JSONs for summary mode serialization stability

Please review the spec and leave feedback before implementation begins.

#5325

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@NaluTripician NaluTripician changed the title [Internal] Spec: CosmosDiagnostics Compaction — Summary Mode [Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode Feb 26, 2026
@NaluTripician NaluTripician self-assigned this Mar 3, 2026
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Addressed in latest push — good call that since compaction is serialization-only, it doesn't belong on RequestOptions. Changes:

  • Removed DiagnosticsVerbosity? from RequestOptions entirely
  • Added ToString(DiagnosticsVerbosity) and ToJsonString(DiagnosticsVerbosity) overloads on CosmosDiagnostics so callers control verbosity at serialization time
  • Kept CosmosClientOptions.DiagnosticsVerbosity as a client-wide default that flows to CosmosDiagnostics.Verbosity (used by parameterless ToString()) — useful for centralized logging pipelines
  • Simplified precedence: ToString(param) > Verbosity property (from client options) > env var > default

NaluTripician and others added 4 commits March 5, 2026 12:28
- Run openspec init to create openspec/ directory structure
- Configure openspec/config.yaml with Cosmos SDK project context and artifact rules
- Create openspec/README.md with comprehensive developer guide:
  - Workflow overview (propose → apply → archive)
  - Slash command reference
  - Writing good specs guide with examples
  - Best practices and anti-patterns
  - FAQ
- Update .github/copilot-instructions.md with OpenSpec section
- Update CONTRIBUTING.md with spec-driven development guidance

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create behavioral specifications for all major SDK feature areas:

P0 - SDK Fundamentals + Critical Infrastructure:
- crud-operations: Core Create/Read/Replace/Upsert/Delete operations
- query-and-linq: SQL query execution, LINQ, FeedIterator, pagination
- partition-keys: Single, hierarchical/multi-hash, routing, extraction
- handler-pipeline: Handler chain ordering and responsibilities
- retry-and-failover: Cross-region retries, throttle retries, PPAF/PPCB
- cross-region-hedging: Availability strategies, thresholds, cancellation

P1 - SDK Design + Active Features:
- client-and-configuration: CosmosClient lifecycle, options, connection modes
- change-feed: Modes, processor, leases, partition distribution
- batch-and-transactional: TransactionalBatch, bulk execution
- diagnostics-and-observability: OpenTelemetry, diagnostics, tracing
- serialization: Serializer contracts, STJ vs Newtonsoft

P2 - Supplementary:
- container-and-database-management: DB/Container CRUD, indexing, throughput
- patch-operations: JSON patch, partial updates
- distributed-transactions: DTS (evolving, under active development)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ranches content

- Enhances config.yaml with prescriptive artifact rules (EARS notation, Mermaid diagrams, task separation, baseline tests, contract enforcement)
- Adds 3 new specs from feature-specs branch: client-side-encryption, consistency-and-session, transport-and-connectivity
- Adds specs/README.md catalog index organized by area
- Enhances all 14 existing specs with richer detail (tables, code blocks, edge cases, cross-references)
- Streamlines openspec/README.md for clarity

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
/// overridden by the caller before calling ToString(), or by using the
/// ToString(DiagnosticsVerbosity) overload.
/// </summary>
public DiagnosticsVerbosity Verbosity { get; set; } = DiagnosticsVerbosity.Detailed;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even have this property when there is a ToString overload accepting it? IMO this is not needed and introduces weird race conditions - just drop it - ToString() defaults to Detailed (backwards compatibility) and anyone else can use the new overloads.

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the DiagnosticsVerbosity on CosmsoDiagnostics

/// Default: 8192 (8 KB). Minimum: 4096 (4 KB).
/// Can also be set via the AZURE_COSMOS_DIAGNOSTICS_MAX_SUMMARY_SIZE environment variable.
/// </summary>
public int MaxDiagnosticsSummarySizeBytes { get; set; } = 8192;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact of it on the fidelity of the troubleshooting context?

- **Memory pressure** — large diagnostic strings increase GC overhead, especially at high throughput
- **Readability** — operators cannot quickly extract signal from noise when hundreds of identical retry entries are listed

**Example scenario:** A point read that encounters 50 retries due to 429 throttling in West US 2, then fails over to East US 2 with 10 more retries, produces ~60 full `StoreResponseStatistics` entries in the trace tree. With summary mode, this compacts to: first request + last request + 1 aggregated group per region.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross partition queries are other bigger source of issues.

/// aggregate statistics (count, total RU, min/max/P50 latency).
/// Respects MaxDiagnosticsSummarySizeBytes limit.
/// </summary>
Summary = 1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the guidance for customers on when to use what?

`CosmosDiagnostics.ToString()` produces a JSON trace that grows **unboundedly** with retries. Each retry attempt creates a new child `ITrace` node containing a full `ClientSideRequestStatisticsTraceDatum` with complete `StoreResponseStatistics` and `HttpResponseStatistics` entries. In pathological scenarios (sustained 429 throttling, transient failures, cross-region failovers), a single operation's diagnostics can grow to hundreds of KB.

**Impact:**
- **Log truncation** — monitoring systems (Application Insights, Azure Monitor, etc.) silently drop oversized log entries
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON is verbose, thouhgts on encoding which might help/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts from offline discussion:

  • Consider a hybrid format with JSON text readable summary + encoded information that contains more details for debugging. Have an easy way to decode the information for IcMs + general debugging. Decoding would have to also be available for customers. Possibly add a tool in the repo to do this. Could be standardized across SDKs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also important, how can we preserve important information such as which replicas are contacted/failing?

NaluTripician and others added 8 commits March 24, 2026 11:24
Adds a spec sheet for the diagnostics compaction feature that introduces
a DiagnosticsVerbosity option (Detailed vs Summary) to reduce unbounded
diagnostics output size. Summary mode groups requests by region, keeps
first/last request in full detail, and deduplicates middle requests with
aggregate statistics (count, RU, min/max/P50/avg latency).

Reference: Azure/azure-sdk-for-rust#3592

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tring/ToJson overloads

Address review feedback: DiagnosticsVerbosity only impacts serialization,
so it belongs on CosmosDiagnostics.ToString()/ToJsonString() methods
rather than on RequestOptions.

Changes:
- Remove DiagnosticsVerbosity from RequestOptions (section 3.3)
- Add ToString(DiagnosticsVerbosity) and ToJsonString(DiagnosticsVerbosity)
  overloads to CosmosDiagnostics (section 3.3)
- Simplify precedence chain (remove RequestOptions level)
- Update implementation plan, work items, and test plan

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…5688)

# Pull Request Template

## Description

Adds contracts and changelog updates to master. No contract changes.

## Type of change

Please delete options that are not relevant.

- [] Bug fix (non-breaking change which fixes an issue)

## Closing issues

To automatically close an issue: closes #IssueNumber

---------

Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
## Description

Fixes #5620

Replaces the unsafe `(T)(object)stream` cast pattern with safe `is`
pattern matching in all `FromStream<T>` serializer implementations
across the SDK.

### Problem

The `FromStream<T>` method in multiple serializer implementations uses
the following pattern:

```csharp
if (typeof(Stream).IsAssignableFrom(typeof(T)))
{
    return (T)(object)stream;
}
```

`typeof(Stream).IsAssignableFrom(typeof(T))` returns `true` when `T` is
`Stream` **or any subclass** (e.g., `MemoryStream`, `FileStream`). If
`T` is a specific `Stream` subclass but the runtime `stream` parameter
is a different `Stream` type, the cast `(T)(object)stream` throws a raw
`InvalidCastException` with no context about what went wrong.

**Example that throws:**
```csharp
// T = MemoryStream, but stream is actually a FileStream at runtime
serializer.FromStream<MemoryStream>(someFileStream); // InvalidCastException!
```

### Fix

Replaced the unsafe cast with safe `is` pattern matching:

```csharp
if (typeof(Stream).IsAssignableFrom(typeof(T)))
{
    if (stream is T typedStream)
    {
        return typedStream;
    }

    throw new InvalidCastException(
        $"Stream of type '{stream.GetType().FullName}' is not compatible "
        + $"with the requested type '{typeof(T).FullName}'.");
}
```

This provides:
- ✅ Safe runtime type checking (no unexpected `InvalidCastException`)
- ✅ A descriptive error message identifying both the actual and expected
types
- ✅ No behavioral change for the common case (`T = Stream`)

### Files Changed

**Core SDK Serializers (2):**
- `Microsoft.Azure.Cosmos/src/Serializer/CosmosJsonDotNetSerializer.cs`
-
`Microsoft.Azure.Cosmos/src/Serializer/CosmosSystemTextJsonSerializer.cs`

**Sample Code (3):**
-
`Microsoft.Azure.Cosmos.Samples/Usage/SystemTextJson/CosmosSystemTextJsonSerializer.cs`
-
`Microsoft.Azure.Cosmos.Samples/Usage/ReEncryption/ReEncryptionSupport/ReEncryptionJsonSerializer.cs`
- `Microsoft.Azure.Cosmos.Samples/Usage/ItemManagement/Program.cs`

**Encryption Modules (2):**
- `Microsoft.Azure.Cosmos.Encryption/src/CosmosJsonDotNetSerializer.cs`
-
`Microsoft.Azure.Cosmos.Encryption.Custom/src/Common/CosmosJsonDotNetSerializer.cs`

**Unit Tests (2):**
-
`Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/CosmosJsonSerializerUnitTests.cs`
— 2 new tests
-
`Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Json/CosmosSystemTextJsonSerializerTest.cs`
— 2 new tests

### Not Changed

- **`PatchOperationCore{T}.cs`** — Has a similar-looking pattern but
casts FROM `T` TO `Stream` (upcast), which always succeeds. No fix
needed.
- **Test utility serializers** — Internal test code, not
customer-facing.

### Testing

- 4 new unit tests added (2 per serializer):
- `ValidateFromStreamWithBaseStreamType` /
`TestFromStreamWithBaseStreamType` — Confirms
`FromStream<Stream>(memoryStream)` succeeds (regression test)
- `ValidateFromStreamWithIncompatibleStreamTypeThrowsDescriptiveError` /
`TestFromStreamWithIncompatibleStreamTypeThrowsDescriptiveError` —
Confirms `FromStream<FileStream>(memoryStream)` throws
`InvalidCastException` with a descriptive message containing both type
names
- All existing tests continue to pass

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#5679)

## Summary

This PR adds comprehensive unit test coverage for the FaultInjection
library, increasing the unit test count from 1 to 49.

> **Note:** This PR should be merged **after** PR #5675 (bug fixes) and
PR #5678 (code quality). 8 tests use `Assert.Inconclusive` for
validations that depend on fixes in those PRs — they will become fully
passing once those PRs are merged.

### Changes

**New Test File: `FaultInjectionBuilderValidationTests.cs`**

Contains 4 test classes with 49 test methods:

1. **`FaultInjectionBuilderValidationTests`** (21 tests) — #5670
- `FaultInjectionRuleBuilder`: null/empty ID, null condition/result, hit
limit validation, Gateway+Gone rejection
- `FaultInjectionServerErrorResultBuilder`: injection rate boundaries
(0, 0.5, 1.0, 1.1, -0.5), delay validation for all delay types,
WithDelay on non-delay type
- `FaultInjectionEndpointBuilder`: null database/container/feedRange,
negative replica count, valid build
- `FaultInjectionConnectionErrorResultBuilder`: threshold boundaries,
negative interval
- `FaultInjectionCustomServerErrorResultBuilder`: basic build, injection
rate validation

2. **`FaultInjectorUnitTests`** (5 tests) — #5671
- Null rules constructor, empty rules, GetApplicationContext before
init, unknown activity ID, GetClientOptions

3. **`FaultInjectionApplicationContextUnitTests`** (7 tests) — #5672
- Add/get by rule ID, get by activity ID, non-existent lookups, multiple
executions, GetAllRuleExecutions, concurrent access

4. **`FaultInjectionRuleLifecycleTests`** (16 tests) — #5673
- Enable/disable toggle, uninitialized hit count, default condition (All
types), all operation types enumeration, ToString formatting

**Bug Fix: `FaultInjectionServerErrorResult.ToString()`**
- Fixed `FormatException` caused by unbalanced braces in the format
string (`{3}}` → `{3}}}`).

### Test Results (on master)
- ✅ 41 passed
- ⏭️ 8 skipped (Inconclusive — depend on PRs #5675 and #5678)
- ❌ 0 failed

Parent tracking issue: #5652

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cketsHttpHandler. (#5693)

# Pull Request Template

## Description

Enables EnableMultipleHttp2Connections on SocketsHttpHandler to allow
the SDK to open additional HTTP/2 TCP connections when the concurrent
streams limit on an existing connection is reached, improving throughput
in Gateway (thin client) mode.

## Type of change

Please delete options that are not relevant.

- [X] Bug fix (non-breaking change which fixes an issue)


## Closing issues

To automatically close an issue: closes #IssueNumber
…read requests (#5685)

# Pull Request Template

## Description
Currently, Cosmos DB accounts are configured with a single default
consistency level, and the existing per-request override only allows
weakening it — there's no way to strengthen reads or choose a
fundamentally different read strategy without changing the account-level
setting. This becomes a limitation for workloads where certain reads
need the latest committed data via quorum reads.

ReadConsistencyStrategy introduces a new dimension of control by
allowing applications to specify their desired read behavior (Eventual,
Session, LatestCommitted, GlobalStrong) per-request or per-client,
completely independent of the
account's default consistency. Unlike the existing ConsistencyLevel
override, these strategies map directly to how the Direct layer reads
from replicas — single replica reads for Eventual, session-token-aware
reads for Session, quorum reads with GLSN barrier for LatestCommitted,
and quorum reads with GCLSN barrier for GlobalStrong.

This PR adds the ReadConsistencyStrategy property is exposed on all
read-path request options (ItemRequestOptions, QueryRequestOptions,
ChangeFeedRequestOptions, ReadManyRequestOptions), CosmosClientOptions.
The SDK sets the x-ms-cosmos-read-consistency-strategy header and
propagates the strategy to DocumentServiceRequestContext, where the
Direct package's ConsistencyReader and QuorumReader handle the actual
read mode selection.

## Type of change

Please delete options that are not relevant.

- [X] New feature (non-breaking change which adds functionality)

## Closing issues

To automatically close an issue: closes #IssueNumber
…ew feedback

Address Fabian's review comment: drop the mutable Verbosity property from
CosmosDiagnostics to avoid thread-safety issues. Callers use the explicit
ToString(DiagnosticsVerbosity) / ToJsonString(DiagnosticsVerbosity) overloads
instead. Parameterless ToString() always returns Detailed (backward compat).

Changes:
- Remove Verbosity get/set property from CosmosDiagnostics API surface
- Update CosmosClientOptions doc to clarify it is a config value, not auto-flowed
- Simplify precedence order from 4 levels to 3
- Remove ResponseMessage.cs from modified files list (no verbosity propagation needed)
- Update WI-1 and WI-3 scope/acceptance criteria
- Replace ToString_UsesSummary_WhenVerbosityPropertySet test with
  ToString_Parameterless_AlwaysDetailed test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NaluTripician and others added 2 commits March 24, 2026 11:24
ToString(DiagnosticsVerbosity) already covers this use case, making
ToJsonString(DiagnosticsVerbosity) redundant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move behavioral requirements into openspec/specs/diagnostics-and-observability/spec.md
  using EARS notation (WHEN/THEN/SHALL)
- Create openspec/changes/diagnostics-compaction/ with proposal, design, and tasks
- Remove old specs/diagnostics-compaction.md
- Update spec index README

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician force-pushed the users/nalutripician/diagnostics-compaction-spec branch from ffbab35 to 43f6d9e Compare March 24, 2026 18:34
…/diagnostics-compaction-spec

# Conflicts:
#	openspec/config.yaml
NaluTripician added a commit that referenced this pull request Mar 31, 2026
Brings in the OpenSpec change spec (design, proposal, tasks) for the
diagnostics compaction feature alongside the implementation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants