Agent Metric Induction from Open-Ended Human Feedback
ICLR 2026
AutoLibra turns open-ended natural-language feedback on agent trajectories into fine-grained, interpretable evaluation metrics. Give it sentences like "the agent did not choose iPhone 14/15" or "this agent has too much autonomy", and it gives back metrics with definitions and concrete examples — ready to drive an LLM-as-a-Judge, diagnose failure modes, or serve as optimization targets for prompt engineering.
The pipeline has three LLM-driven operators:
- Feedback grounding — break feedback into
(behavior, feedback, sign)aspects tied to a specific part of the trajectory. - Behavior clustering — group similar aspects into metrics with definitions + positive/negative examples.
- LLM-as-a-Judge — score trajectories on the induced metrics, producing positive and negative traits.
coverage and redundancy meta-metrics then measure how well the induced
metric set covers unseen human feedback.
uv is the supported package manager.
git clone https://github.com/Open-Social-World/autolibra
cd autolibra
uv syncAutoLibra talks to Azure OpenAI. Create a .env in the repo root:
AZURE_API_KEY=...
AZURE_ENDPOINT=https://<your-resource>.openai.azure.com/
AZURE_OPENAI_4O_MODEL=gpt-4o
AZURE_OPENAI_O3_MODEL=o3-mini
GITHUB_PERSONAL_ACCESS_TOKEN=... # optional, for some GitHub-backed dataset loadersimport asyncio
from openai import AsyncAzureOpenAI
from autolibra_core import (
feedback_grounding,
behavior_clustering,
run_llm_eval,
MetricTrainingInstance,
)
async def main():
client = AsyncAzureOpenAI(azure_endpoint=..., api_key=..., api_version=...)
# 1. Build MetricTrainingInstance objects from your trajectories + feedback.
# See `autolibra_core.datasets.*` for loaders; each yields instances like:
# MetricTrainingInstance(task=..., agent_id=..., trajectory=..., feedback=...)
instances: list[MetricTrainingInstance] = load_my_data()
# 2. Ground feedback into aspects per instance.
aspects = []
for inst in instances:
aspects.extend(await feedback_grounding(inst, client))
# 3. Cluster aspects into metrics with definitions + behavior examples.
induced = await behavior_clustering(aspects, client)
metrics = induced.metrics
# 4. Run LLM-as-a-Judge on the trajectories with the induced metrics.
results = await run_llm_eval(instances, metrics, client)
print(metrics)
print(results)
asyncio.run(main())Shared trajectories + feedback live on Hugging Face at
open-social-world/autolibra.
Install git-lfs first, then:
git clone https://huggingface.co/datasets/open-social-world/autolibra .dataPreprocess or convert from the original sources with the dataset loaders in
autolibra_core.datasets (one module per dataset):
uv run python -m autolibra_core.datasets.webarena
uv run python -m autolibra_core.datasets.sotopia
uv run python -m autolibra_core.datasets.cogym
uv run python -m autolibra_core.datasets.balrog_babaisai
# ...To contribute a new dataset, add a loader under
packages/autolibra-core/src/autolibra_core/datasets/ that emits
MetricTrainingInstance objects, then push the converted artifacts to the
shared HF repo.
Two annotation frontends ship in src/tty/:
# Terminal UI
uv run python src/tty/tty_annotation.py \
.data/webarena .data/annotations/webarena \
--annotator-id <your-name>
# Streamlit UI (browser)
uv run streamlit run src/tty/tty_annotation.py \
.data/sotopia .data/annotations/sotopia \
-- --annotator-id <your-name> --use-streamlit
# Review annotations
uv run streamlit run src/tty/view_annotations.py \
-- .data/annotations/sotopia/annotationsAnnotators are shown each trajectory step by step and write a single piece of natural-language feedback per trajectory. See §2.1 of the paper for annotation guidelines.
src/training/ contains the end-to-end scripts used in the paper, including
the Gemini-based iterative induction pipeline from §4 ("AutoLibra as a
Ladder"). See src/training/README.md for the per-script breakdown.
The prompt-optimization harness used on Baba-Is-AI (Fig. 4 in the paper) is in
prompt_optimization/; environment submodules
(BALROG, etc.) are declared in .gitmodules.
autolibra/
├── packages/
│ ├── autolibra-core/ # Core operators, evaluators, dataset loaders
│ └── osw-data/ # Trajectory / metric primitives
├── src/
│ ├── tty/ # Annotation UIs (terminal + Streamlit)
│ ├── training/ # Iterative induction / eval pipelines
│ ├── tools/ # Analysis utilities
│ └── plot/ # Figure generation
├── prompt_optimization/ # Agent prompt-optimization harness
└── pyproject.toml
Issues and PRs welcome. Before opening a PR:
uv run pre-commit run --all-files
uv run pytest@inproceedings{zhu2026autolibra,
title = {AutoLibra: Agent Metric Induction from Open-Ended Human Feedback},
author = {Hao Zhu and Phil Cuvin and Xinkai Yu and
Charlotte Ka Yee Yan and Jason Zhang and Diyi Yang},
booktitle = {The Fourteenth International Conference on Learning Representations},
year = {2026},
url = {https://openreview.net/forum?id=4BjGVZ7Bxn}
}Supported by ONR N000142412532, NSF IIS-2247357, and DARPA Friction for Accountability in Conversational Transactions. Compute credits from Google Cloud Platform and Modal. Thanks to the Stanford SALT Lab for feedback.
