Add prompt baseline test to check lisa_test_writer prompt quality by paxue · Pull Request #4365 · microsoft/lisa

paxue · 2026-03-20T01:35:35Z

Description

Add prompt baseline test to check "lisa_test_writer.prompt.md" prompt quality.
Threshold default is 70% for cloud large model. 65% for local small model.
Can also used to verified other prompt to help generate lisa test.

Related Issue

Type of Change

Checklist

Description is filled in above
No credentials, secrets, or internal details are included
Peer review requested (if not, add required peer reviewers after raising PR)
Tests executed and results posted below

Test Validation

Key Test Cases:

Impacted LISA Features:

Tested Azure Marketplace Images:

Test Results

Image	VM Size	Result
		PASSED / FAILED / SKIPPED

Copilot

Pull request overview

Adds a prompt evaluation framework under .github/prompts/eval/ to baseline and score the quality of outputs produced by lisa_test_writer.prompt.md, using JSONL-defined cases and an LLM-as-judge scoring rubric.

Changes:

Added eval_runner.py to run prompt eval cases against a configurable LLM provider and emit scored results.
Added cases.jsonl with baseline evaluation prompts + rubrics across multiple capability dimensions.
Added documentation (README.md) describing setup, usage, scoring, and case design.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`.github/prompts/eval/scripts/eval_runner.py`	Implements the runner (provider selection, generation, judging, scoring, output).
`.github/prompts/eval/cases.jsonl`	Defines baseline eval cases and rubrics used by the runner.
`.github/prompts/eval/README.md`	Documents how to run/evolve the prompt evaluation framework and interpret results.

.github/prompts/eval/scripts/eval_runner.py

.github/prompts/eval/README.md

.github/prompts/eval/scripts/eval_runner.py

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

.github/prompts/eval/scripts/eval_runner.py

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

.github/prompts/eval/scripts/eval_runner.py

modify PR review process to add some standard

paxue requested review from LiliDeng and johnsongeorge-w as code owners March 20, 2026 01:35

Copilot AI review requested due to automatic review settings March 20, 2026 01:35

Copilot started reviewing on behalf of paxue March 20, 2026 01:36 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 20, 2026 01:49

Copilot started reviewing on behalf of paxue March 20, 2026 01:50 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

.github/prompts/eval/scripts/eval_runner.py Show resolved Hide resolved

.github/prompts/eval/scripts/eval_runner.py Show resolved Hide resolved

Copilot AI review requested due to automatic review settings April 4, 2026 00:05

paxue force-pushed the paxue/eval_prompt branch from ce1fc11 to e55a96a Compare April 4, 2026 00:05

Copilot started reviewing on behalf of paxue April 4, 2026 00:05 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

add prompt evaluation for lisa_test_writer.prompt.md

9fbad18

modify PR review process to add some standard

paxue force-pushed the paxue/eval_prompt branch from e55a96a to 9fbad18 Compare April 4, 2026 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prompt baseline test to check lisa_test_writer prompt quality#4365

Add prompt baseline test to check lisa_test_writer prompt quality#4365
paxue wants to merge 1 commit intomainfrom
paxue/eval_prompt

paxue commented Mar 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paxue commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Test Validation

Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paxue commented Mar 20, 2026 •

edited

Loading