Welcome to Waza, a CLI tool for evaluating AI agent skills. This guide covers everything you need to get started: installation, creating evaluations, running benchmarks, and reviewing results in the dashboard.
Waza helps you:
- Define agent skills with comprehensive documentation and behavioral requirements
- Create test suites with realistic test cases and validation rules
- Run evaluations against different AI models to measure skill effectiveness
- Compare results across models and versions to track improvement
- View metrics in an interactive dashboard with live results, trends, and detailed analysis
Perfect for skill authors, platform teams, and developers building AI-powered applications.
Choose one of three methods:
The fastest way to get started. The script auto-detects your OS and architecture:
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashThis downloads the latest release, verifies the checksum, and installs the waza binary to:
/usr/local/bin/waza(if writable), or~/bin/waza(if/usr/local/binis not writable)
After installation, verify it works:
waza --versionNote: If installed to ~/bin, add it to your PATH:
export PATH="$HOME/bin:$PATH"Requires Go 1.26 or later:
go install github.com/microsoft/waza/cmd/waza@latestVerify installation:
waza --versionIf you use Azure Developer CLI (azd), install waza as an extension:
# Add the waza extension registry
azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
# Install the extension
azd ext install microsoft.azd.waza
# Verify it works
azd waza --helpAll waza commands are available under azd waza:
azd waza init my-project
azd waza run eval.yaml
azd waza serveGet a complete evaluation suite running in 5 minutes.
Create a new directory and initialize a waza project:
mkdir my-eval-suite
cd my-eval-suite
waza initYou'll be prompted to create your first skill. Enter a name like code-explainer, and the scaffolding is automatic. This creates:
my-eval-suite/
├── skills/ # Skill definitions
│ └── code-explainer/
│ └── SKILL.md # Skill metadata and description
├── evals/ # Evaluation suites
│ └── code-explainer/
│ ├── eval.yaml # Evaluation configuration
│ ├── tasks/ # Test case definitions
│ │ ├── basic-usage.yaml
│ │ └── edge-cases.yaml
│ └── fixtures/ # Test data and resources
│ └── sample.py
├── .github/workflows/
│ └── eval.yml # CI/CD pipeline
├── .gitignore
└── README.md
Skip the prompt? Use --no-skill:
waza init --no-skillIf you didn't create one during init, add a new skill:
waza new skill code-analyzerThis scaffolds a new skill with SKILL.md and eval suite. The interactive wizard will collect metadata (name, description, use cases).
Edit evals/code-explainer/eval.yaml:
name: code-explainer
description: "Tests the agent's ability to explain code"
model: claude-sonnet-4.6
maxTokens: 4096
tasks:
- id: basic-usage
description: "Explain a simple Python function"
fixture: sample.py
expectedOutput:
- contains: "function"
- contains: "parameter"
- regex: "returns.*value"
validators:
- type: contains
caseSensitive: false
- type: text
caseSensitive: falseAdd test case YAML files in tasks/:
# tasks/basic-usage.yaml
id: basic-usage
description: "Explain a simple Python function"
prompt: "Explain what this function does:\n{{fixture:sample.py}}"
expectedOutput:
- type: contains
value: "function"
- type: text
value: "returns.*value"
tags: ["basic", "core"]Add fixtures in fixtures/:
# fixtures/sample.py
def greet(name):
"""Return a greeting message."""
return f"Hello, {name}!"Execute the benchmark:
waza run evals/code-explainer/eval.yaml --context-dir evals/code-explainer/fixtures -vOutput:
✓ Passed— Task passed all validators✗ Failed— Task failed one or more validators- Summary:
X passed, Y failed
Save results to a JSON file:
waza run evals/code-explainer/eval.yaml \
--context-dir evals/code-explainer/fixtures \
-o results.jsonLaunch the web dashboard:
waza serveThe browser opens automatically to http://localhost:3000. You'll see:
- Dashboard Overview — Run history, pass rate, model comparison
- Run Details — Individual task results, validation output
- Compare — Side-by-side model performance
- Trends — Pass rate over time
Initialize a waza project with directory structure.
Flags:
--no-skill— Skip the first-skill creation prompt
What it creates:
skills/— Skill definitions directoryevals/— Evaluation suites directory.github/workflows/eval.yml— CI/CD pipeline.gitignore— With waza-specific entriesREADME.md— Getting started guide
Example:
waza init my-skills --no-skillCreate a new skill with its evaluation suite.
Flags:
--template, -t— Template pack (coming soon)
Modes:
Project mode (inside a skills/ directory):
cd my-skills-repo
waza new skill code-explainerCreates skills/code-explainer/SKILL.md and evals/code-explainer/.
Standalone mode (no skills/ directory):
cd my-project
waza new skill my-skillCreates my-skill/ with SKILL.md, evals/, CI/CD pipeline, and README.
Examples:
# Interactive wizard in a terminal
waza new skill code-analyzer
# Non-interactive (CI/CD)
waza new skill code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOFRecord a prompt execution and generate a reusable task YAML from the observed run.
The generated task includes inferred validators based on:
- Assistant response content
- Tool usage sequence
- Skill invocation events
Flags:
--model— Model used for recording (default:claude-sonnet-4.5)--testname— Test ID/name to write to task YAML (default:auto-generated-test)--tags— Comma-separated tags added to the generated task--timeout— Prompt execution timeout (default:5m)--overwrite— Replace output file if it already exists--root— Root directory used for skill discovery
Examples:
# Create a task from one recorded prompt run
waza new task from-prompt "Refactor this function for readability" evals/code-explainer/tasks/refactor-readability.yaml
# Add task metadata and overwrite if needed
waza new task from-prompt "Explain this diff and risks" evals/code-explainer/tasks/diff-analysis.yaml \
--testname diff-analysis \
--tags recorded,regression \
--overwriteRun an evaluation benchmark.
Arguments:
[eval.yaml]— Path to evaluation spec file[skill-name]— Skill name (auto-detects eval.yaml)- (none) — Auto-detect using workspace detection
Flags:
--context-dir— Fixtures directory (default:./fixtures)--output, -o— Save results JSON--verbose, -v— Detailed progress output--parallel— Run tasks concurrently--workers <n>— Number of concurrent workers (default: 4)--task <pattern>— Filter tasks by name (glob pattern, can repeat)--tags <pattern>— Filter tasks by tags (glob pattern, can repeat)--model <model>— Override model (can repeat for multi-model runs)--cache— Enable result caching--no-cache— Disable caching (default)--cache-dir <path>— Cache directory (default:.waza-cache)--format <format>— Output format:default,github-comment--interpret— Print plain-language interpretation of results--baseline— Run A/B comparison: with skills vs without
Examples:
Run an evaluation from a spec file:
waza run evals/code-explainer/eval.yaml --context-dir evals/code-explainer/fixtures -vRun a specific skill:
waza run code-explainerAuto-detect in workspace:
# Single-skill workspace → runs that skill's eval
# Multi-skill workspace → runs all evals with summary
waza runRun specific tasks:
waza run evals/code-explainer/eval.yaml --task "basic*" --task "edge*"Run with specific models:
waza run evals/code-explainer/eval.yaml --model "gpt-4o" --model "claude-sonnet-4.6"Save results:
waza run evals/code-explainer/eval.yaml -o results.jsonValidate that a skill is ready for submission.
Arguments:
[skill-name]— Skill name (e.g.,code-explainer)[skill-path]— Path to skill directory (e.g.,skills/my-skill)- (none) — Auto-detect using workspace detection
What it checks:
- Compliance scoring — Validates SKILL.md frontmatter (Low/Medium/Medium-High/High)
- Token budget — Ensures SKILL.md is within limits
- Evaluation presence — Confirms eval.yaml exists
Output:
- ✓ Compliance level (e.g., "Medium-High")
- ✓ Token count / limit
- ✓ Evaluation status
- Suggestions for improvement
Examples:
Check a specific skill:
waza check code-explainerCheck by path:
waza check skills/my-skillAuto-detect in workspace:
# Single skill → checks that skill
# Multiple skills → summary table for all
waza checkLaunch the web dashboard to view evaluation results.
Flags:
--port <n>— HTTP server port (default:3000)--no-browser— Don't auto-open the browser--results-dir <path>— Directory to read results from (default:.)--tcp <address>— JSON-RPC TCP server (e.g.,:9000)--tcp-allow-remote— Bind to all interfaces (default: loopback only)--http— Start HTTP dashboard (default)
Dashboard Pages:
- Overview — Run history, pass rates, model comparison
- Run Details — Individual task results, validation logs, timestamps
- Compare — Side-by-side model performance
- Trends — Pass rate and timing trends over time
- Live View — Real-time results during active runs
Examples:
Start the dashboard:
waza serveCustom port:
waza serve --port 8080Load results from a specific directory:
waza serve --results-dir ./archived-runsDon't auto-open browser:
waza serve --no-browserJSON-RPC TCP server (for IDE integration):
waza serve --tcp :9000Cache evaluation results to avoid redundant runs:
# Enable caching (results stored in .waza-cache/)
waza run evals/code-explainer/eval.yaml --cache -o results.jsonThe cache stores results by task ID and model, keyed on the prompt and expected output. Same inputs = same cached results.
Use cases:
- Development — Iterate on validators without re-running tasks
- CI/CD — Avoid redundant API calls for unchanged evals
- Offline — Use cached results when connection is unavailable
Disable caching:
waza run evals/code-explainer/eval.yaml --no-cacheSpecify cache location:
waza run evals/code-explainer/eval.yaml --cache --cache-dir ./my-cacheRun only specific tasks:
# By task name (glob patterns)
waza run evals/code-explainer/eval.yaml --task "basic*" --task "edge*"
# By tag
waza run evals/code-explainer/eval.yaml --tags "critical"Task names and tags are defined in your task YAML files:
# tasks/basic-usage.yaml
id: basic-usage
description: "Explain a simple function"
tags: ["basic", "core"]Run tasks concurrently to speed up evaluation:
# Use default 4 workers
waza run evals/code-explainer/eval.yaml --parallel
# Use custom worker count
waza run evals/code-explainer/eval.yaml --parallel --workers 8When to use parallel execution:
- Many tasks (50+)
- Long-running validators
- Development machines with spare CPU
Run the same evaluation against multiple models:
waza run evals/code-explainer/eval.yaml \
--model "gpt-4o" \
--model "claude-sonnet-4.6" \
--model "gpt-4-turbo"Results are grouped by model. Compare them in the dashboard or with:
waza compare results-gpt4.json results-sonnet.jsonRun waza as part of your GitHub Actions workflow:
# .github/workflows/eval.yml
name: Evaluate Skills
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v4
with:
go-version: '1.26'
- name: Install waza
run: |
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
- name: Run evals
run: |
waza run --output results.json -v
- name: Comment PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('results.json'));
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `Evaluation Results: ${results.passCount}/${results.totalCount} passed`
});Capture detailed session logs for debugging:
waza run evals/code-explainer/eval.yaml --session-log --session-dir ./logsLogs are stored in NDJSON format (one event per line):
{"event":"task_started","id":"basic-usage","timestamp":"2024-01-15T10:30:00Z"}
{"event":"task_completed","id":"basic-usage","passed":true,"timestamp":"2024-01-15T10:30:05Z"}Default format (human-readable):
waza run evals/code-explainer/eval.yamlGitHub Comment format (for PR comments):
waza run evals/code-explainer/eval.yaml --format github-commentJSON output (for automation):
waza run evals/code-explainer/eval.yaml -o results.jsonThe waza serve command launches an interactive web dashboard for viewing and analyzing evaluation results.
waza serveThe dashboard opens automatically at http://localhost:3000.
- Overview — Evaluation summary, pass rates, recent runs
- Run Details — Click a run to see task-by-task results
- Compare — Select multiple runs to compare models
- Trends — Historical pass rates and performance over time
- Live View — Real-time results during active evaluations
- Live updates — See results as tasks complete
- Search — Find runs, tasks, and models
- Filtering — Filter by status (passed/failed), tags, date range
- Export — Download results as CSV or JSON
- Dark mode — Switch to dark theme for comfortable viewing
Custom port:
waza serve --port 8080Skip auto-open:
waza serve --no-browserLoad results from archive:
waza serve --results-dir ./previous-runsJSON-RPC for IDE integration:
waza serve --tcp :9000If you see "address already in use," the default port 3000 is taken.
Solution: Use a different port:
waza serve --port 3001Or find and stop the process using port 3000:
# macOS/Linux
lsof -i :3000
kill -9 <PID>
# Windows
netstat -ano | findstr :3000
taskkill /PID <PID> /FIf the dashboard shows no results:
-
Check results directory: Make sure you've saved results with
-o:waza run evals/code-explainer/eval.yaml -o results.json
-
Specify results directory: If results are in a subdirectory:
waza serve --results-dir ./eval-results
-
Verify file format: Results must be JSON files (
*.json).
If you see "fixture directory not found":
-
Check the path: Is your
--context-dircorrect?waza run eval.yaml --context-dir ./evals/fixtures
-
Use relative paths: Path is relative to the spec file:
eval.yaml fixtures/ -
From any directory: Specify absolute paths:
waza run /path/to/eval.yaml --context-dir /path/to/fixtures
If tasks fail validation unexpectedly:
-
Enable verbose output: See what the model returned:
waza run eval.yaml -v
-
Check your validators: Are the rules too strict?
validators: - type: contains value: "function" caseSensitive: false # Don't be too strict
-
Adjust expected output: Make rules realistic:
expectedOutput: - type: contains value: "important keyword" # Not every word needs to match
- Create your first skill — Step-by-step walkthrough
- Skill best practices — Guidelines for effective skills
- Validators reference — All validator types and options
- Token limits — Optimize SKILL.md for size
- Examples — Real eval suites to explore
- GitHub Issues — Report bugs or request features
- Discussions — Ask questions, share ideas
- Waza Examples — Runnable evaluation suites
Happy evaluating! 🚀