This repository contains waza, a CLI tool for evaluating Agent Skills. The primary implementation is Go (waza-go/). The Python implementation (waza/) is legacy and no longer actively developed.
When making changes, follow these guidelines to maintain consistency and quality.
Keep issues and tracking up to date:
- Tracking Issue: #66 - Waza Platform Roadmap
- PRD: docs/PRD.md
- When completing work, update the relevant GitHub issue
- Reference issue numbers in commit messages (e.g.,
feat: Add tokens command #47)
waza-go/
├── cmd/waza/ # CLI entrypoint
│ └── main.go # Command parsing and execution
├── internal/
│ ├── config/ # Configuration with functional options
│ ├── execution/ # AgentEngine interface and implementations
│ │ ├── engine.go # Core engine interface
│ │ ├── mock.go # Mock engine for testing
│ │ └── copilot.go # Copilot SDK integration
│ ├── models/ # Data structures
│ │ ├── spec.go # BenchmarkSpec (eval configuration)
│ │ ├── testcase.go # TestCase (task definition)
│ │ └── outcome.go # EvaluationOutcome (results)
│ ├── orchestration/ # TestRunner for coordinating execution
│ │ └── runner.go # Benchmark orchestration
│ └── scoring/ # Validator interface and implementations
│ ├── validator.go # Validator registry pattern
│ └── code_validators.go # Code and text validators
├── go.mod
├── go.sum
├── Makefile # Build and test commands
└── .golangci.yml # Linter configuration
The Go implementation uses idiomatic Go naming:
| Concept | Go Name | Python Equivalent |
|---|---|---|
| Eval configuration | BenchmarkSpec |
EvalSpec |
| Executor | AgentEngine |
BaseExecutor |
| Grader | Validator |
Grader |
| Task | TestCase |
Task |
| Result | EvaluationOutcome |
EvalResult |
engine := execution.NewCopilotEngine(
execution.WithModel("gpt-4o"),
execution.WithTimeout(300 * time.Second),
execution.WithVerbose(true),
)type AgentEngine interface {
Execute(ctx context.Context, testCase *models.TestCase) (*models.ExecutionResult, error)
Shutdown() error
}registry := scoring.NewValidatorRegistry()
registry.Register("code", &scoring.CodeValidator{})
registry.Register("text", &scoring.TextValidator{})Requires Go 1.26 or later. The module targets
go 1.26(go.mod), which means Go 1.26 language features and standard library APIs are safe to use. If you want to rely on features or stdlib additions introduced after Go 1.26, first bump thegoversion ingo.modto that minimum version.
cd waza-go
# Build
make build
# or: go build -o waza ./cmd/waza
# Run tests
make test
# or: go test -v ./...
# Lint
make lint
# or: golangci-lint run
# Run evaluation
./waza run ../examples/code-explainer/eval.yaml --context-dir ../examples/code-explainer/fixtures -vEvery PR must leave tests in a passing state. This is non-negotiable:
- All existing tests must pass — run
go test ./...before pushing. If your change breaks an existing test, fix it. - New features require new tests — every new command, flag, grader, or internal function needs test coverage. No shipping untested code.
- Bug fixes require regression tests — if you fix a bug, add a test that would have caught it.
- Playwright e2e tests — if you change the dashboard (
web/), runcd web && npx playwright test --project=chromiumand fix any failures. - CI is the gate —
Build and Test Go ImplementationandLint Go Codemust pass. PRs with failing tests do not merge.
Go CI is required for all PRs. Branch protection enforces:
Build and Test Go Implementationmust passLint Go Codemust pass
The workflow is defined in .github/workflows/go-ci.yml.
Each task execution gets a fresh temp workspace with fixtures copied in:
- Runner reads files from original
--context-dir(fixtures folder) - Executor creates new temp workspace (e.g.,
/tmp/waza-abc123/) - Files copied into temp workspace
- Agent works in temp workspace (edits happen here)
- Temp workspace destroyed after task
- Next task starts fresh with original fixtures
The original fixtures directory is never modified. This ensures task isolation.
Use Mermaid for all diagrams in markdown files (docs, design docs, plans). No ASCII art diagrams.
Always update documentation when making changes. The following files must be kept in sync:
| File | Purpose | Update When |
|---|---|---|
README.md |
Main project overview | Any CLI change, new feature |
waza-go/README.md |
Go implementation details | Go code changes |
docs/PRD.md |
Product requirements | Feature scope changes |
AGENTS.md |
Agent coding guidelines | Process/pattern changes |
site/ (GitHub Pages) |
Public docs site (microsoft.github.io/waza) | Any feature add/change |
web/ (Dashboard) |
Interactive eval dashboard | New data in results JSON, new views needed |
When adding or updating any feature:
- Check if
site/src/content/docs/pages need updating (graders, CLI reference, guides, eval YAML) - Check if the dashboard (
web/) needs updates or new views to surface the feature - Update
README.mdif user-facing - Update
waza-go/README.mdusage section if CLI changes - Build the docs site to verify:
cd site && npm run build - Add example in appropriate docs
- Update tracking issue #66 if related to roadmap
When adding a new CLI command or flag:
- Add to
site/src/content/docs/reference/cli.mdx - Add to
site/src/content/docs/guides/if it needs a guide - Update
README.mdCommands section
When completing a feature:
- Close related GitHub issue with comment
- Update tracking issue #66 checkbox
- Verify GitHub Pages site reflects the change (pages deploy on merge to main)
Documentation must be updated in real-time as features change. This is enforced by Saul (Documentation Lead) who reviews all PRs for doc impact.
| Change Type | Required Doc Updates |
|---|---|
| New CLI command or flag | README.md Commands section, site/ CLI reference, docs/GUIDE.md |
| Changed CLI behavior | README.md, site/ guides, docs/GUIDE.md, affected tutorials |
| New/changed dashboard view | site/ dashboard guide, regenerate screenshots, docs/DEMO-GUIDE.md |
| Changed eval YAML schema | README.md YAML section, site/ eval-yaml reference, example files |
| New validator/grader | README.md Validators section, site/ graders page, docs/GUIDE.md |
| New sensei/dev feature | site/ sensei guide, README.md |
| New data in results JSON | Check if dashboard (web/) needs a new view, column, or chart to surface it |
When dashboard UI changes, regenerate screenshots:
cd web && npx playwright test e2e/screenshots.spec.ts --project=chromiumScreenshots are saved to docs/images/ and referenced throughout documentation.
- Add command handling in
cmd/waza/main.go - Implement logic in appropriate
internal/package - Add tests in
*_test.gofiles - Update
waza-go/README.md
- Implement
Validatorinterface ininternal/scoring/ - Register in
ValidatorRegistry - Add tests
- Document in README
- Implement
AgentEngineinterface ininternal/execution/ - Add configuration options
- Add tests
- Document usage
The .github/CODEOWNERS file automatically assigns reviewers:
- All files → @spboyer @chlowell @richardpark-msft
PRs to main require:
- Go CI must pass (
Build and Test Go Implementation,Lint Go Code) - Auto-merge enabled for convenience
Use conventional commits:
feat:New featurefix:Bug fixdocs:Documentation onlyci:CI/CD changeschore:Maintenance tasksrefactor:Code restructuring
Reference issues: feat: Add tokens command #47
These are generated/temporary and should not be committed:
results.json- Eval resultscoverage.txt- Test coveragewaza(binary) - Built executable
cd waza-go
make build
./waza run ../examples/code-explainer/eval.yaml -vcd waza-go
make test-v, --verbose- Verbose output-o, --output- Save results JSON--context-dir- Fixtures directory
See Tracking Issue #66 for the full roadmap.
| Epic | Priority | Description |
|---|---|---|
| E1: Go CLI Foundation | P0 | Core CLI commands |
| E2: Sensei Engine | P0 | Compliance scoring |
| E3: Evaluation Framework | P0 | Cross-model testing |
| E4: Token Management | P1 | Budget tracking |
| E5: Waza Skill | P1 | Conversational interface |
| E6: CI/CD Integration | P1 | GitHub Actions |
| E7: AZD Extension | P2 | azd packaging |