Skip to content

Commit b6818e8

Browse files
committed
Squash to get signed committs
1 parent 3215165 commit b6818e8

20 files changed

+3459
-0
lines changed
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
name: save-markdown-to-disk
3+
description: >
4+
Save markdown content to a file on disk without bash HEREDOC corruption.
5+
Use this skill whenever you need to write markdown, code, or any multi-line
6+
text containing backticks, dollar signs, single quotes, or other shell
7+
metacharacters to a file.
8+
allowed-tools: create, python3, shell
9+
---
10+
11+
# Save Markdown to Disk
12+
13+
## Problem
14+
15+
Writing markdown to files using bash HEREDOC (`cat << 'EOF'`) breaks when content contains:
16+
- Backticks (`` ` ``) — interpreted as command substitution even in some HEREDOC forms
17+
- `$variable` — interpreted as shell expansion in unquoted HEREDOCs
18+
- The HEREDOC delimiter appearing in the content itself
19+
- Nested quotes and backslashes causing silent corruption
20+
21+
This is the #1 cause of garbled reports when agents write files via shell.
22+
23+
## Solution: use the create tool
24+
25+
You should use the `create` tool to write markdown files, which handles all escaping and encoding issues for you. This should be the first thing you try when you need to save markdown content to disk.
26+
27+
## Solution 2: Use a quoted HEREDOC with a unique delimiter
28+
29+
You can use a quoted HEREDOC with a unique delimiter to avoid shell expansion. However, this method can still fail if the content contains the delimiter or certain combinations of characters. Sometimes the bash tool will block this approach if there appears to be dangerous shell commands in the document. Here's how you can do it:
30+
31+
```bash
32+
cat << 'UNIQUE_DELIMITER' > final_markdown.md
33+
<your_markdown_content_here>
34+
UNIQUE_DELIMITER
35+
```
36+
37+
## Solution 3: base64 encode the content
38+
39+
If you don't have the create tool available, you can write base64 encoded content to avoid shell escaping issues. However, due to the issues mentioned above, you must base64 encode the content in the HEREDOC and decode it when writing to the file. You must base64 encode the content yourself - do not rely on external tools. Here's how you extract the base64 content and write it to a file:
40+
41+
```bash
42+
# Base64 encode the markdown content and write it to a file
43+
echo "<your_base64_encoded_content_here>" | base64 --decode encoded_content.txt > final_markdown.md
44+
```
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
__pycache__/
Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
---
2+
name: windows-log-analysis
3+
description: >
4+
Analyzes Windows AKS node log bundles collected by collect-windows-logs.ps1.
5+
Use this skill when asked to diagnose a Windows AKS node issue, investigate
6+
disk pressure, container failures, image accumulation, service crashes,
7+
network problems, or extension errors on a Windows node.
8+
allowed-tools: shell
9+
---
10+
11+
# Windows AKS Node Log Analysis Skill
12+
13+
## Overview
14+
15+
Windows AKS node log bundles are collected by `staging/cse/windows/debug/collect-windows-logs.ps1`.
16+
Each collection run produces files prefixed with a timestamp in `yyyyMMdd-HHmmss` format. Multiple
17+
snapshots may be present in a single extracted bundle — one per collection run.
18+
19+
This skill uses **sub-skill files** that instruct LLM sub-agents how to analyze each log category.
20+
The sub-skills are in the `sub-skills/` directory relative to this file.
21+
22+
---
23+
24+
## How to Analyze a Log Bundle
25+
26+
### Step 1: Discover the Bundle Structure
27+
28+
List the extracted log directory to identify:
29+
- Available snapshot timestamps (filenames matching `YYYYMMDD-HHMMSS-*`)
30+
- Which file types are present
31+
- Whether Extension-Logs zips/directories exist
32+
33+
### Step 2: Dispatch Sub-Skills
34+
35+
Read `common-reference.md` first — it contains shared encoding/format knowledge and **dispatch guidance** for choosing the right sub-skills based on symptoms.
36+
37+
**Always run** (triage):
38+
39+
| Sub-Skill | What It Covers |
40+
|-----------|---------------|
41+
| `common-reference.md` | Encoding, formats, thresholds, error codes, dispatch guidance |
42+
| `analyze-containers.md` | Container restarts, crash-loops, pod readiness |
43+
| `analyze-services.md` | Windows service health, node versions, OS info |
44+
45+
**Dispatch by symptom** (see common-reference.md § Dispatch Guidance for full table):
46+
47+
| Sub-Skill | When to Run |
48+
|-----------|------------|
49+
| `analyze-termination.md` | Pods stuck Terminating, zombie HCS, orphaned shims |
50+
| `analyze-hcs.md` | HCS operational health, container lifecycle, vmcompute issues |
51+
| `analyze-hns.md` | HNS endpoints, load balancers, CNI/DNS, WFP/VFP analysis |
52+
| `analyze-kubeproxy.md` | Service routing, DSR policies, port range conflicts, SNAT |
53+
| `analyze-images.md` | Dangling images, mutable tags, snapshot bloat, GC failures |
54+
| `analyze-disk.md` | C: drive free space trends |
55+
| `analyze-kubelet.md` | Node conditions, lease renewal, evictions, clock skew, certs |
56+
| `analyze-memory.md` | Physical memory, pagefile, OOM, process memory |
57+
| `analyze-crashes.md` | App crashes, BSODs, WER reports, kernel dumps |
58+
| `analyze-csi.md` | CSI proxy, SMB/Azure Files mount failures, Azure Disk |
59+
| `analyze-gmsa.md` | gMSA/CCG authentication, Kerberos, credential specs |
60+
| `analyze-gpu.md` | GPU health, nvidia-smi, DirectX device plugin |
61+
| `analyze-bootstrap.md` | Node provisioning, CSE flow, bootstrap config |
62+
| `analyze-extensions.md` | Azure VM extension execution errors |
63+
64+
For unknown issues or comprehensive health checks, run all sub-skills in parallel.
65+
66+
### Step 3: Verify and Challenge Findings
67+
68+
Before synthesizing, apply skeptical review to each sub-skill's findings:
69+
70+
1. **Cross-validate**: Does finding A from one sub-skill contradict finding B from another? If so, investigate — one of them is wrong.
71+
2. **Check proportionality**: Is the severity proportionate to the evidence? (e.g., 3 transient errors ≠ CRITICAL)
72+
3. **Verify causal chains**: If claiming "A caused B", confirm timestamps show A preceded B and no other explanation fits better.
73+
4. **Challenge your top diagnosis**: Actively look for evidence it's wrong. What would you expect to see if your diagnosis were correct but don't? What alternative diagnosis fits the same evidence?
74+
5. **Separate observation from inference**: State what you directly observed vs. what you inferred. Mark inferences explicitly.
75+
76+
### Step 4: Synthesize Findings
77+
78+
Combine verified findings from all sub-skills into a unified diagnosis using the decision tree and root cause chains below.
79+
80+
### Step 5: State Overall Confidence
81+
82+
End the report with an explicit confidence assessment:
83+
84+
```markdown
85+
## Confidence Assessment
86+
87+
**Primary diagnosis**: [your diagnosis]
88+
**Confidence**: HIGH / MEDIUM / LOW
89+
**Why this confidence level**: [1-2 sentences explaining what evidence supports it and what gaps remain]
90+
**What would change my mind**: [what evidence, if found, would invalidate this diagnosis]
91+
**What I couldn't verify**: [list any claims that lack full evidence]
92+
```
93+
94+
---
95+
96+
## Synthesis Decision Tree
97+
98+
```
99+
Any CRITICAL in containers?
100+
├─ Yes, crash-looping containers
101+
│ → Check images for dangling images / mutable tags
102+
│ → Check crashes + memory for OOM or service crashes
103+
│ → Check disk for pressure causing evictions
104+
└─ Yes, pods not Ready
105+
→ Check services for service crashes near the failure time
106+
→ Check termination for zombie HCS state
107+
108+
Pods stuck in Terminating?
109+
→ Check termination findings:
110+
- CRITICAL: containerd/kubelet reinstalled (services)
111+
- CRITICAL: stable shim PIDs across snapshots
112+
- CRITICAL/WARNING: HCS terminate failures
113+
- WARNING: Defender without containerd data path exclusion
114+
→ Check images for snapshot bloat amplifying Defender latency
115+
→ Check hcs for lifecycle completeness
116+
117+
Any CRITICAL in images?
118+
→ Immediate: crictl rmi --prune
119+
→ Root cause: switch to immutable image tags
120+
121+
Any CRITICAL in disk (< 15 GB free)?
122+
→ Check images for dangling image count (most common cause)
123+
→ Check crashes for WER dump accumulation
124+
125+
Any CRITICAL in hns?
126+
├─ Endpoint leaks → Check termination for zombie HCS holding endpoints
127+
├─ LB count drop → Check services for HNS restart events
128+
└─ CNI failures → Check kubelet for DiskPressure/MemoryPressure
129+
130+
Any CRITICAL in kubeproxy?
131+
├─ Port range conflicts → Check excludedportrange.txt vs NodePort range
132+
├─ Stale LB rules → Check hns for LB inventory
133+
└─ Service unreachable → Check hns for endpoint state
134+
135+
Any CRITICAL in kubelet?
136+
├─ NotReady → Check crashes for kubelet crash / BSOD
137+
├─ DiskPressure → Check disk + images
138+
└─ MemoryPressure → Check memory for pagefile/RAM
139+
140+
Any CRITICAL in memory?
141+
→ Check crashes for OOM-triggered crashes
142+
→ Check containers for crash-loops from OOM kills
143+
144+
Any CRITICAL in crashes?
145+
├─ BSOD/kernel dump → Escalate to Windows platform team
146+
└─ containerd/kubelet crash → Check termination for orphaned HCS post-crash
147+
148+
Any CRITICAL in csi?
149+
→ Check kubelet for volume mount timeout correlation
150+
→ Check services for csi-proxy service state
151+
152+
Any CRITICAL in gmsa?
153+
→ Check hcs for credential setup errors
154+
→ Check hns for DNS resolution to domain controllers
155+
→ Check kubelet for clock skew (Kerberos sensitivity)
156+
157+
Any CRITICAL in bootstrap?
158+
→ Check extensions for CSE exit codes
159+
→ Check services for startup ordering failures
160+
161+
Any CRITICAL in gpu?
162+
→ Check kubelet for device plugin registration
163+
→ Check services for GPU-related service state
164+
165+
Any CRITICAL in extensions?
166+
→ Node likely failed to provision — check bootstrap for full timeline
167+
```
168+
169+
---
170+
171+
## Root Cause Chain Tracing
172+
173+
Common root cause chains on Windows AKS nodes:
174+
175+
| Symptom | → Check | → Root Cause |
176+
|---------|---------|-------------|
177+
| Disk pressure | images (dangling count) | Mutable image tags causing accumulation |
178+
| Crash-looping containers | crashes + memory (OOM, service crashes) | Memory exhaustion or service instability |
179+
| Pods stuck Terminating | termination (reinstall, zombies) | containerd reinstalled without draining |
180+
| Node not joining cluster | bootstrap + extensions (exit codes, CSE flow) | Extension download/execution failure |
181+
| High restart counts | disk + memory | Disk pressure causing evictions + OOM |
182+
| DNS resolution failures | hns (endpoints, DNS config) | HNS endpoint leaks or misconfigured DNS |
183+
| SNAT exhaustion | kubeproxy (WFP netevents) | High outbound connection churn |
184+
| Node NotReady | kubelet (conditions, lease renewal) | Lease renewal timeout or kubelet crash |
185+
| Memory exhaustion / OOM | memory (available, pagefile, processes) | Undersized pagefile or memory leak |
186+
| Unexpected reboots | crashes (Event 6008, WER, minidumps) | BSOD, containerd OOM, or Windows Update |
187+
| containerd crash → orphaned containers | crashes + termination | containerd crash without clean shutdown |
188+
| Slow pod termination | termination (Defender) + images | Defender scanning snapshots without path exclusions |
189+
| Service routing broken | kubeproxy + hns (LB policies, endpoints) | Stale HNS policies or kube-proxy sync failure |
190+
| Volume mount failures | csi + kubelet (mount timeouts) | Stale SMB mappings or credential expiry |
191+
| gMSA auth failures | gmsa + hns (DNS) + kubelet (clock) | CCG plugin error, DC unreachable, or clock skew |
192+
| GPU scheduling failures | gpu + kubelet (device plugin) | Driver mismatch or device plugin not registered |
193+
194+
---
195+
196+
## Timeline Correlation
197+
198+
When findings span multiple sub-skills, build a timeline:
199+
200+
1. **Anchor events**: Find the earliest significant event (reboot, service crash, reinstall)
201+
2. **Cascade tracking**: Trace the effect forward in time:
202+
- Service reinstall → HCS zombies → pods stuck → disk fills
203+
- OOM event → containerd crash → container restarts → pod not Ready
204+
- HNS restart → LB policies lost → service unreachable
205+
3. **Timestamp alignment**: Match timestamps across CSV events, snapshot prefixes, and kubectl output
206+
4. **Snapshot comparison**: Use multi-snapshot data to distinguish "always broken" from "recently degraded"
207+
208+
---
209+
210+
## Common Remediations
211+
212+
| Issue | Immediate Fix | Root Cause Fix |
213+
|-------|--------------|----------------|
214+
| Dangling images filling disk | `crictl rmi --prune` | Switch to immutable image tags |
215+
| Pods stuck Terminating | `hcsdiag kill <id>` + force delete pod | Drain before reinstalling containerd |
216+
| Crash-looping container | `kubectl describe pod` + check logs | Fix OOM/resource limits or application bug |
217+
| Extension failure | Re-run CSE or reimage node | Fix network/firewall blocking downloads |
218+
| HNS endpoint leaks | Restart HNS service or drain node | Fix workload churn, investigate HNS bugs |
219+
| Node NotReady (lease) | Restart kubelet, check apiserver connectivity | Fix network path to apiserver |
220+
| Memory pressure / OOM | Kill memory-hungry processes | Increase pagefile, fix memory leaks, set resource limits |
221+
| BSOD / kernel crash | Reboot node, collect dumps | Escalate to Windows platform team with dump files |
222+
| Defender slowing container ops | `Add-MpPreference -ExclusionPath "C:\ProgramData\containerd"` | Update CSE `Update-DefenderPreferences` to include containerd paths |
223+
| Service routing broken | Restart kube-proxy, verify HNS LB state | Fix stale LB policy cleanup, update kube-proxy |
224+
| Volume mount failures | Remove stale SMB mappings: `Remove-SmbGlobalMapping` | Fix credential rotation, update CSI proxy |
225+
| gMSA auth failures | Verify DC connectivity + clock sync | Fix CCG plugin config, ensure DNS to DCs works |
226+
| Port range conflicts | Adjust NodePort range to avoid excluded ranges | Configure service port ranges at cluster level |

0 commit comments

Comments
 (0)