[Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode#5644
[Internal] Diagnostics: Adds spec for CosmosDiagnostics compaction summary mode#5644NaluTripician wants to merge 15 commits intomasterfrom
Conversation
|
Addressed in latest push — good call that since compaction is serialization-only, it doesn't belong on
|
- Run openspec init to create openspec/ directory structure - Configure openspec/config.yaml with Cosmos SDK project context and artifact rules - Create openspec/README.md with comprehensive developer guide: - Workflow overview (propose → apply → archive) - Slash command reference - Writing good specs guide with examples - Best practices and anti-patterns - FAQ - Update .github/copilot-instructions.md with OpenSpec section - Update CONTRIBUTING.md with spec-driven development guidance Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create behavioral specifications for all major SDK feature areas: P0 - SDK Fundamentals + Critical Infrastructure: - crud-operations: Core Create/Read/Replace/Upsert/Delete operations - query-and-linq: SQL query execution, LINQ, FeedIterator, pagination - partition-keys: Single, hierarchical/multi-hash, routing, extraction - handler-pipeline: Handler chain ordering and responsibilities - retry-and-failover: Cross-region retries, throttle retries, PPAF/PPCB - cross-region-hedging: Availability strategies, thresholds, cancellation P1 - SDK Design + Active Features: - client-and-configuration: CosmosClient lifecycle, options, connection modes - change-feed: Modes, processor, leases, partition distribution - batch-and-transactional: TransactionalBatch, bulk execution - diagnostics-and-observability: OpenTelemetry, diagnostics, tracing - serialization: Serializer contracts, STJ vs Newtonsoft P2 - Supplementary: - container-and-database-management: DB/Container CRUD, indexing, throughput - patch-operations: JSON patch, partial updates - distributed-transactions: DTS (evolving, under active development) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ranches content - Enhances config.yaml with prescriptive artifact rules (EARS notation, Mermaid diagrams, task separation, baseline tests, contract enforcement) - Adds 3 new specs from feature-specs branch: client-side-encryption, consistency-and-session, transport-and-connectivity - Adds specs/README.md catalog index organized by area - Enhances all 14 existing specs with richer detail (tables, code blocks, edge cases, cross-references) - Streamlines openspec/README.md for clarity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
specs/diagnostics-compaction.md
Outdated
| /// overridden by the caller before calling ToString(), or by using the | ||
| /// ToString(DiagnosticsVerbosity) overload. | ||
| /// </summary> | ||
| public DiagnosticsVerbosity Verbosity { get; set; } = DiagnosticsVerbosity.Detailed; |
There was a problem hiding this comment.
Why even have this property when there is a ToString overload accepting it? IMO this is not needed and introduces weird race conditions - just drop it - ToString() defaults to Detailed (backwards compatibility) and anyone else can use the new overloads.
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
LGTM except for the DiagnosticsVerbosity on CosmsoDiagnostics
specs/diagnostics-compaction.md
Outdated
| /// Default: 8192 (8 KB). Minimum: 4096 (4 KB). | ||
| /// Can also be set via the AZURE_COSMOS_DIAGNOSTICS_MAX_SUMMARY_SIZE environment variable. | ||
| /// </summary> | ||
| public int MaxDiagnosticsSummarySizeBytes { get; set; } = 8192; |
There was a problem hiding this comment.
What's the impact of it on the fidelity of the troubleshooting context?
specs/diagnostics-compaction.md
Outdated
| - **Memory pressure** — large diagnostic strings increase GC overhead, especially at high throughput | ||
| - **Readability** — operators cannot quickly extract signal from noise when hundreds of identical retry entries are listed | ||
|
|
||
| **Example scenario:** A point read that encounters 50 retries due to 429 throttling in West US 2, then fails over to East US 2 with 10 more retries, produces ~60 full `StoreResponseStatistics` entries in the trace tree. With summary mode, this compacts to: first request + last request + 1 aggregated group per region. |
There was a problem hiding this comment.
Cross partition queries are other bigger source of issues.
specs/diagnostics-compaction.md
Outdated
| /// aggregate statistics (count, total RU, min/max/P50 latency). | ||
| /// Respects MaxDiagnosticsSummarySizeBytes limit. | ||
| /// </summary> | ||
| Summary = 1, |
There was a problem hiding this comment.
What's the guidance for customers on when to use what?
specs/diagnostics-compaction.md
Outdated
| `CosmosDiagnostics.ToString()` produces a JSON trace that grows **unboundedly** with retries. Each retry attempt creates a new child `ITrace` node containing a full `ClientSideRequestStatisticsTraceDatum` with complete `StoreResponseStatistics` and `HttpResponseStatistics` entries. In pathological scenarios (sustained 429 throttling, transient failures, cross-region failovers), a single operation's diagnostics can grow to hundreds of KB. | ||
|
|
||
| **Impact:** | ||
| - **Log truncation** — monitoring systems (Application Insights, Azure Monitor, etc.) silently drop oversized log entries |
There was a problem hiding this comment.
JSON is verbose, thouhgts on encoding which might help/
There was a problem hiding this comment.
Thoughts from offline discussion:
- Consider a hybrid format with JSON text readable summary + encoded information that contains more details for debugging. Have an easy way to decode the information for IcMs + general debugging. Decoding would have to also be available for customers. Possibly add a tool in the repo to do this. Could be standardized across SDKs.
There was a problem hiding this comment.
Also important, how can we preserve important information such as which replicas are contacted/failing?
Adds a spec sheet for the diagnostics compaction feature that introduces a DiagnosticsVerbosity option (Detailed vs Summary) to reduce unbounded diagnostics output size. Summary mode groups requests by region, keeps first/last request in full detail, and deduplicates middle requests with aggregate statistics (count, RU, min/max/P50/avg latency). Reference: Azure/azure-sdk-for-rust#3592 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tring/ToJson overloads Address review feedback: DiagnosticsVerbosity only impacts serialization, so it belongs on CosmosDiagnostics.ToString()/ToJsonString() methods rather than on RequestOptions. Changes: - Remove DiagnosticsVerbosity from RequestOptions (section 3.3) - Add ToString(DiagnosticsVerbosity) and ToJsonString(DiagnosticsVerbosity) overloads to CosmosDiagnostics (section 3.3) - Simplify precedence chain (remove RequestOptions level) - Update implementation plan, work items, and test plan Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…5688) # Pull Request Template ## Description Adds contracts and changelog updates to master. No contract changes. ## Type of change Please delete options that are not relevant. - [] Bug fix (non-breaking change which fixes an issue) ## Closing issues To automatically close an issue: closes #IssueNumber --------- Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
## Description Fixes #5620 Replaces the unsafe `(T)(object)stream` cast pattern with safe `is` pattern matching in all `FromStream<T>` serializer implementations across the SDK. ### Problem The `FromStream<T>` method in multiple serializer implementations uses the following pattern: ```csharp if (typeof(Stream).IsAssignableFrom(typeof(T))) { return (T)(object)stream; } ``` `typeof(Stream).IsAssignableFrom(typeof(T))` returns `true` when `T` is `Stream` **or any subclass** (e.g., `MemoryStream`, `FileStream`). If `T` is a specific `Stream` subclass but the runtime `stream` parameter is a different `Stream` type, the cast `(T)(object)stream` throws a raw `InvalidCastException` with no context about what went wrong. **Example that throws:** ```csharp // T = MemoryStream, but stream is actually a FileStream at runtime serializer.FromStream<MemoryStream>(someFileStream); // InvalidCastException! ``` ### Fix Replaced the unsafe cast with safe `is` pattern matching: ```csharp if (typeof(Stream).IsAssignableFrom(typeof(T))) { if (stream is T typedStream) { return typedStream; } throw new InvalidCastException( $"Stream of type '{stream.GetType().FullName}' is not compatible " + $"with the requested type '{typeof(T).FullName}'."); } ``` This provides: - ✅ Safe runtime type checking (no unexpected `InvalidCastException`) - ✅ A descriptive error message identifying both the actual and expected types - ✅ No behavioral change for the common case (`T = Stream`) ### Files Changed **Core SDK Serializers (2):** - `Microsoft.Azure.Cosmos/src/Serializer/CosmosJsonDotNetSerializer.cs` - `Microsoft.Azure.Cosmos/src/Serializer/CosmosSystemTextJsonSerializer.cs` **Sample Code (3):** - `Microsoft.Azure.Cosmos.Samples/Usage/SystemTextJson/CosmosSystemTextJsonSerializer.cs` - `Microsoft.Azure.Cosmos.Samples/Usage/ReEncryption/ReEncryptionSupport/ReEncryptionJsonSerializer.cs` - `Microsoft.Azure.Cosmos.Samples/Usage/ItemManagement/Program.cs` **Encryption Modules (2):** - `Microsoft.Azure.Cosmos.Encryption/src/CosmosJsonDotNetSerializer.cs` - `Microsoft.Azure.Cosmos.Encryption.Custom/src/Common/CosmosJsonDotNetSerializer.cs` **Unit Tests (2):** - `Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/CosmosJsonSerializerUnitTests.cs` — 2 new tests - `Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Json/CosmosSystemTextJsonSerializerTest.cs` — 2 new tests ### Not Changed - **`PatchOperationCore{T}.cs`** — Has a similar-looking pattern but casts FROM `T` TO `Stream` (upcast), which always succeeds. No fix needed. - **Test utility serializers** — Internal test code, not customer-facing. ### Testing - 4 new unit tests added (2 per serializer): - `ValidateFromStreamWithBaseStreamType` / `TestFromStreamWithBaseStreamType` — Confirms `FromStream<Stream>(memoryStream)` succeeds (regression test) - `ValidateFromStreamWithIncompatibleStreamTypeThrowsDescriptiveError` / `TestFromStreamWithIncompatibleStreamTypeThrowsDescriptiveError` — Confirms `FromStream<FileStream>(memoryStream)` throws `InvalidCastException` with a descriptive message containing both type names - All existing tests continue to pass --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
#5679) ## Summary This PR adds comprehensive unit test coverage for the FaultInjection library, increasing the unit test count from 1 to 49. > **Note:** This PR should be merged **after** PR #5675 (bug fixes) and PR #5678 (code quality). 8 tests use `Assert.Inconclusive` for validations that depend on fixes in those PRs — they will become fully passing once those PRs are merged. ### Changes **New Test File: `FaultInjectionBuilderValidationTests.cs`** Contains 4 test classes with 49 test methods: 1. **`FaultInjectionBuilderValidationTests`** (21 tests) — #5670 - `FaultInjectionRuleBuilder`: null/empty ID, null condition/result, hit limit validation, Gateway+Gone rejection - `FaultInjectionServerErrorResultBuilder`: injection rate boundaries (0, 0.5, 1.0, 1.1, -0.5), delay validation for all delay types, WithDelay on non-delay type - `FaultInjectionEndpointBuilder`: null database/container/feedRange, negative replica count, valid build - `FaultInjectionConnectionErrorResultBuilder`: threshold boundaries, negative interval - `FaultInjectionCustomServerErrorResultBuilder`: basic build, injection rate validation 2. **`FaultInjectorUnitTests`** (5 tests) — #5671 - Null rules constructor, empty rules, GetApplicationContext before init, unknown activity ID, GetClientOptions 3. **`FaultInjectionApplicationContextUnitTests`** (7 tests) — #5672 - Add/get by rule ID, get by activity ID, non-existent lookups, multiple executions, GetAllRuleExecutions, concurrent access 4. **`FaultInjectionRuleLifecycleTests`** (16 tests) — #5673 - Enable/disable toggle, uninitialized hit count, default condition (All types), all operation types enumeration, ToString formatting **Bug Fix: `FaultInjectionServerErrorResult.ToString()`** - Fixed `FormatException` caused by unbalanced braces in the format string (`{3}}` → `{3}}}`). ### Test Results (on master) - ✅ 41 passed - ⏭️ 8 skipped (Inconclusive — depend on PRs #5675 and #5678) - ❌ 0 failed Parent tracking issue: #5652 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cketsHttpHandler. (#5693) # Pull Request Template ## Description Enables EnableMultipleHttp2Connections on SocketsHttpHandler to allow the SDK to open additional HTTP/2 TCP connections when the concurrent streams limit on an existing connection is reached, improving throughput in Gateway (thin client) mode. ## Type of change Please delete options that are not relevant. - [X] Bug fix (non-breaking change which fixes an issue) ## Closing issues To automatically close an issue: closes #IssueNumber
…read requests (#5685) # Pull Request Template ## Description Currently, Cosmos DB accounts are configured with a single default consistency level, and the existing per-request override only allows weakening it — there's no way to strengthen reads or choose a fundamentally different read strategy without changing the account-level setting. This becomes a limitation for workloads where certain reads need the latest committed data via quorum reads. ReadConsistencyStrategy introduces a new dimension of control by allowing applications to specify their desired read behavior (Eventual, Session, LatestCommitted, GlobalStrong) per-request or per-client, completely independent of the account's default consistency. Unlike the existing ConsistencyLevel override, these strategies map directly to how the Direct layer reads from replicas — single replica reads for Eventual, session-token-aware reads for Session, quorum reads with GLSN barrier for LatestCommitted, and quorum reads with GCLSN barrier for GlobalStrong. This PR adds the ReadConsistencyStrategy property is exposed on all read-path request options (ItemRequestOptions, QueryRequestOptions, ChangeFeedRequestOptions, ReadManyRequestOptions), CosmosClientOptions. The SDK sets the x-ms-cosmos-read-consistency-strategy header and propagates the strategy to DocumentServiceRequestContext, where the Direct package's ConsistencyReader and QuorumReader handle the actual read mode selection. ## Type of change Please delete options that are not relevant. - [X] New feature (non-breaking change which adds functionality) ## Closing issues To automatically close an issue: closes #IssueNumber
…ew feedback Address Fabian's review comment: drop the mutable Verbosity property from CosmosDiagnostics to avoid thread-safety issues. Callers use the explicit ToString(DiagnosticsVerbosity) / ToJsonString(DiagnosticsVerbosity) overloads instead. Parameterless ToString() always returns Detailed (backward compat). Changes: - Remove Verbosity get/set property from CosmosDiagnostics API surface - Update CosmosClientOptions doc to clarify it is a config value, not auto-flowed - Simplify precedence order from 4 levels to 3 - Remove ResponseMessage.cs from modified files list (no verbosity propagation needed) - Update WI-1 and WI-3 scope/acceptance criteria - Replace ToString_UsesSummary_WhenVerbosityPropertySet test with ToString_Parameterless_AlwaysDetailed test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ToString(DiagnosticsVerbosity) already covers this use case, making ToJsonString(DiagnosticsVerbosity) redundant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move behavioral requirements into openspec/specs/diagnostics-and-observability/spec.md using EARS notation (WHEN/THEN/SHALL) - Create openspec/changes/diagnostics-compaction/ with proposal, design, and tasks - Remove old specs/diagnostics-compaction.md - Update spec index README Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ffbab35 to
43f6d9e
Compare
…/diagnostics-compaction-spec # Conflicts: # openspec/config.yaml
Brings in the OpenSpec change spec (design, proposal, tasks) for the diagnostics compaction feature alongside the implementation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds a spec sheet for the diagnostics compaction feature that introduces a
DiagnosticsVerbosityoption (DetailedvsSummary) to reduce unbounded diagnostics output size during high-retry scenarios (429 throttling, transient failures, cross-region failovers).Spec location:
specs/diagnostics-compaction.mdKey Design Decisions
Detailed— no behavioral change for existing usersCosmosClientOptions.DiagnosticsVerbosity+ per-requestRequestOptions.DiagnosticsVerbosityoverride + env varsITracetree is unchangedReference
Modeled after the Rust SDK's approach: Azure/azure-sdk-for-rust#3592
Work Items
DiagnosticsVerbosityenum, properties onCosmosClientOptions/RequestOptions, env var supportToString()on verbosity, size enforcement, truncated fallbackContractEnforcementTestsbaselinesPlease review the spec and leave feedback before implementation begins.
#5325