This document tracks the implementation progress of the Skills Eval Framework.
Build the foundational components.
| Task | Description | Status |
|---|---|---|
| 1.1 | Initialize Python project with pyproject.toml, dependencies | ✅ |
| 1.2 | Define data models/schemas (Task, Trial, Result, EvalSpec) | ✅ |
| 1.3 | Implement base Grader interface and code-based graders | ✅ |
| 1.4 | Implement eval Runner that orchestrates task execution | ✅ |
| 1.5 | Implement JSON reporter for results output | ✅ |
| 1.6 | Create CLI entrypoint (waza run, waza init) |
✅ |
Deliverable: ✅ Working CLI that can run basic evals with code-based graders.
Implement the full grading capabilities.
| Task | Description | Status |
|---|---|---|
| 2.1 | Implement trigger accuracy metric (shouldTrigger/shouldNotTrigger) | ✅ |
| 2.2 | Implement task completion metric with assertion-based grading | ✅ |
| 2.3 | Implement LLM-as-judge grader with configurable rubrics | ✅ |
| 2.4 | Implement behavior quality metrics (tool calls, efficiency) | ✅ |
| 2.5 | Implement composite scoring with configurable weights | ✅ |
Deliverable: ✅ Support for all three grader types with weighted composite scores.
Make it easy to adopt and use.
| Task | Description | Status |
|---|---|---|
| 3.1 | Create waza init <skill-name> scaffolding command |
✅ |
| 3.2 | Add markdown reporter for human-readable reports | ✅ |
| 3.3 | Create GitHub Actions workflow for CI integration | ✅ |
| 3.4 | Write comprehensive README with examples | ✅ |
| 3.5 | Add example eval suite for azure-deploy skill | ✅ |
Deliverable: ✅ Complete developer workflow from init to CI/CD.
Enable meta-evaluation within skill runtimes.
| Task | Description | Status |
|---|---|---|
| 4.1 | Create waza-runner skill with SKILL.md |
✅ |
| 4.2 | Implement skill instructions for running evals | ✅ |
| 4.3 | Add human review workflow support | ✅ |
| 4.4 | Test meta-evaluation capability | ✅ |
Deliverable: ✅ A skill that can evaluate other skills.
Production-ready quality.
| Task | Description | Status |
|---|---|---|
| 5.1 | Add comprehensive test coverage (>80%) | ✅ (34 tests passing) |
| 5.2 | Write specification documentation | ✅ |
| 5.3 | Create tutorial for writing skill evals | ✅ |
| 5.4 | Add examples for different skill types | ✅ |
Deliverable: ✅ Production-ready framework with full documentation.
Real Copilot SDK testing, model comparison, and runtime telemetry.
| Task | Description | Status |
|---|---|---|
| 6.1 | Add copilot-sdk as optional dependency |
✅ |
| 6.2 | Create CopilotExecutor class wrapping SDK |
✅ |
| 6.3 | Add executor and model config options |
✅ |
| 6.4 | Add --model and --executor CLI flags |
✅ |
| 6.5 | Create waza compare command |
✅ |
| 6.6 | Create runtime telemetry module | ✅ |
| 6.7 | Add waza analyze command |
✅ |
Deliverable: ✅ Real integration testing with model comparison and runtime metrics.
waza/
├── waza/ # Python package
│ ├── __init__.py
│ ├── cli.py # CLI entrypoint
│ ├── runner.py # Eval orchestration
│ ├── graders/
│ │ ├── base.py # Abstract grader interface
│ │ ├── code_graders.py # Deterministic graders
│ │ ├── llm_graders.py # LLM-as-judge graders
│ │ └── human_graders.py # Human review workflow
│ ├── metrics/
│ │ ├── task_completion.py
│ │ ├── trigger_accuracy.py
│ │ ├── behavior_quality.py
│ │ └── composite.py
│ ├── reporters/
│ │ ├── json_reporter.py
│ │ ├── markdown_reporter.py
│ │ └── github_reporter.py
│ └── schemas/
│ ├── eval_spec.py
│ ├── task.py
│ └── results.py
├── waza-runner/ # Eval-as-skill
│ └── SKILL.md
├── examples/
├── tests/
├── pyproject.toml
└── README.md
- ⬚ Not started
- ⏳ In progress
- ✅ Complete
⚠️ Blocked
- ✅ Created project structure
- ✅ Implemented complete Phase 1 (Core Framework)
- ✅ Implemented Phase 2 (Grading System)
- ✅ Implemented Phase 3 (Developer Experience)
- ✅ Implemented Phase 4 (Eval-as-Skill)
- ✅ Implemented Phase 5 (Documentation)
- ✅ Implemented Phase 6 (Advanced Integration)
- ✅ 34 tests passing
- ✅ 48 files created
- ✅ CLI working with new commands:
waza run- with--modeland--executorflagswaza init- scaffolds complete eval suitewaza compare- side-by-side model comparisonwaza analyze- runtime telemetry analysiswaza list-graders- available grader typeswaza report- generate reports from results
- ✅ Example evals for azure-deploy and cli-session-recorder skills
- ✅ Created DEMO-SCRIPT.md for video walkthrough
- ✅ Created Integration Testing and Telemetry documentation