This guide explains how to use waza for evaluating skills in the microsoft/skills repository.
Waza is a Go CLI tool for running evaluations on AI agent skills. It's designed to integrate seamlessly with the microsoft/skills CI pipeline, allowing skill authors to validate their work before contributing.
- Go 1.26+: Required for building/installing waza
- Git: For cloning repositories
- GitHub Actions (for CI): Standard ubuntu-latest runner
This is the recommended approach for CI/CD pipelines:
# Install latest version
go install github.com/microsoft/waza/cmd/waza@latest
# Verify installation
waza --versionBenefits:
- No Docker required
- Fast installation (~30 seconds)
- Always gets the latest version
- Works on all platforms
If you prefer containerized environments:
# Clone the waza repository
git clone https://github.com/microsoft/waza.git
cd waza
# Build the Docker image
docker build -t waza:local .
# Run waza in a container
docker run -v $(pwd):/workspace waza:local run eval.yamlBenefits:
- Isolated environment
- Reproducible builds
- No local Go installation needed
For development or local testing:
# Clone the repository
git clone https://github.com/microsoft/waza.git
cd waza
# Build the binary
make build
# Run the binary
./waza --versionYour skill repository should follow this structure:
your-skill/
├── SKILL.md # Skill definition with frontmatter
├── eval/ # Evaluation suite (optional but recommended)
│ ├── eval.yaml # Main benchmark specification
│ ├── tasks/ # Individual task definitions
│ │ ├── task-1.yaml
│ │ └── task-2.yaml
│ └── fixtures/ # Context files for tasks
│ ├── file1.txt
│ └── file2.py
└── .github/
└── workflows/
└── eval.yml # CI workflow for running evals
# Navigate to your skill directory
cd your-skill
# Run the interactive wizard
waza init eval
# Follow the prompts to configure your evaluationIf you have a skill, create the eval suite using waza new:
waza new skill my-skill --output-dir evalCreate eval/eval.yaml:
name: my-skill-eval
skill: my-skill
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
executor: mock # Use mock for CI (no API keys)
parallel: false
graders:
- type: text
name: output_check
config:
regex_match: ["expected pattern"]
tasks:
- "tasks/*.yaml"Create task files in eval/tasks/:
# eval/tasks/example-task.yaml
id: example-task
name: Example Task
description: Demonstrate the skill
stimulus:
message: "Explain what this code does"
context_files:
- "example.py"
graders:
- output_checkAdd context files to eval/fixtures/:
# eval/fixtures/example.py
def hello():
print("Hello, world!")# Basic run
waza run eval/eval.yaml --verbose
# Save results to JSON
waza run eval/eval.yaml --output results.json
# Run specific tasks only
waza run eval/eval.yaml --task "example-*"
# Run with parallel execution
waza run eval/eval.yaml --parallel --workers 4Copy the template workflow to your skill repository:
# From the waza repository
cp .github/workflows/skills-ci-example.yml \
/path/to/your-skill/.github/workflows/eval.ymlOr download directly:
curl -o .github/workflows/eval.yml \
https://raw.githubusercontent.com/microsoft/waza/main/.github/workflows/skills-ci-example.ymlEdit .github/workflows/eval.yml:
on:
pull_request:
branches: [ main ]
paths:
- 'SKILL.md'
- 'eval/**'
push:
branches: [ main ]
jobs:
evaluate-skill:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.26'
- name: Install Waza
run: go install github.com/microsoft/waza/cmd/waza@latest
- name: Run Evaluation
run: waza run eval/eval.yaml --verbose --output results.json
- name: Upload Results
if: always()
uses: actions/upload-artifact@v4
with:
name: evaluation-results
path: results.jsonFor CI, use the mock executor (no API keys needed):
# eval/eval.yaml
config:
executor: mock # Simulates agent behavior for testingFor production testing with real AI models, use the copilot-sdk executor:
# eval/eval.yaml
config:
executor: copilot-sdk
model: claude-sonnet-4-20250514 # or gpt-4o, etc.And set the GITHUB_TOKEN environment variable in your workflow:
- name: Run Evaluation with Copilot
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: waza run eval/eval.yaml --verboseWaza uses exit codes to indicate success or failure in CI:
| Exit Code | Meaning | CI Behavior |
|---|---|---|
| 0 | All tests passed | ✅ Workflow succeeds |
| 1 | One or more tests failed | ❌ Workflow fails |
| 2 | Configuration error (invalid YAML, missing files) | ❌ Workflow fails |
Example usage in CI:
# Run evaluation and fail the build if tests fail
waza run eval/eval.yaml || exit $?
# Capture exit code for custom handling
waza run eval/eval.yaml
EXIT_CODE=$?
if [ $EXIT_CODE -eq 1 ]; then
echo "Tests failed - review results"
elif [ $EXIT_CODE -eq 2 ]; then
echo "Configuration error - check eval.yaml"
fiSave results in JSON format for programmatic analysis:
waza run eval/eval.yaml --output results.jsonOutput structure:
{
"benchmark": {
"name": "my-skill-eval",
"skill": "my-skill",
"version": "1.0"
},
"config": {
"executor": "mock",
"model": "mock-model",
"trials_per_task": 1
},
"outcomes": [
{
"task_id": "example-task",
"status": "passed",
"score": 1.0,
"grader_results": [...]
}
],
"summary": {
"total_tasks": 1,
"passed": 1,
"failed": 0,
"pass_rate": 1.0
}
}Capture detailed execution logs:
waza run eval/eval.yaml --transcript-dir transcripts/Creates one JSON file per task execution in transcripts/.
Waza supports multiple grader types for validating agent output:
| Grader | Purpose | Example Use Case |
|---|---|---|
code |
Python/JavaScript assertions | Validate data structures |
regex |
Pattern matching | Check output format |
file |
File existence/content | Verify generated files |
behavior |
Agent behavior constraints | Limit tool calls, duration |
action_sequence |
Tool call sequence validation | Verify workflow steps |
See docs/GRADERS.md for complete documentation.
The mock executor runs instantly without API calls:
config:
executor: mockUse this for:
- Pull request validation
- Quick local testing
- Grader validation
Always run locally first:
waza run eval/eval.yaml --verboseThis catches configuration errors before pushing.
Include version in your eval.yaml:
version: "1.0"Update the version when making significant changes.
id: fix-authentication-bug # Good
id: task-1 # BadAdd clear descriptions:
description: |
The agent should identify the authentication bug in auth.py
and provide a fix that preserves backward compatibility.Use minimal context files to reduce token usage:
- Include only relevant code
- Remove comments and boilerplate
- Use snippets instead of full files
Ensure your eval file is at eval/eval.yaml or update the workflow:
- run: waza run path/to/your/eval.yamlCheck your YAML syntax:
# Validate YAML
waza run eval/eval.yaml --verboseCommon issues:
- Incorrect indentation
- Missing required fields
- Invalid task references
Review the results:
waza run eval/eval.yaml --output results.json
cat results.json | jq '.outcomes[] | select(.status == "failed")'Check:
- Grader expectations match actual output
- Task descriptions are clear
- Fixtures contain necessary context
Waza requires Go 1.26+:
- uses: actions/setup-go@v5
with:
go-version: '1.26' # Not 1.22 or earliername: Evaluate Skill
on:
pull_request:
branches: [ main ]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.26'
- run: go install github.com/microsoft/waza/cmd/waza@latest
- run: waza run eval/eval.yaml --verbosejobs:
matrix-eval:
strategy:
matrix:
model: [gpt-4o, claude-sonnet-4-20250514]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.26'
- run: go install github.com/microsoft/waza/cmd/waza@latest
- run: |
sed -i "s/model: .*/model: ${{ matrix.model }}/" eval/eval.yaml
waza run eval/eval.yaml --output results-${{ matrix.model }}.jsonon:
schedule:
- cron: '0 0 * * *' # Daily at midnight
jobs:
nightly:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.26'
- run: go install github.com/microsoft/waza/cmd/waza@latest
- run: waza run eval/eval.yaml --verbose --output nightly-results.json
- uses: actions/upload-artifact@v4
with:
name: nightly-results
path: nightly-results.json- Main Documentation: README.md
- Grader Reference: docs/GRADERS.md
- Example Evaluations: examples/
- CI Examples: examples/ci/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Repository: github.com/microsoft/waza