AutoLibra

Agent Metric Induction from Open-Ended Human Feedback
ICLR 2026

AutoLibra turns open-ended natural-language feedback on agent trajectories into fine-grained, interpretable evaluation metrics. Give it sentences like "the agent did not choose iPhone 14/15" or "this agent has too much autonomy", and it gives back metrics with definitions and concrete examples — ready to drive an LLM-as-a-Judge, diagnose failure modes, or serve as optimization targets for prompt engineering.

The pipeline has three LLM-driven operators:

Feedback grounding — break feedback into (behavior, feedback, sign) aspects tied to a specific part of the trajectory.
Behavior clustering — group similar aspects into metrics with definitions + positive/negative examples.
LLM-as-a-Judge — score trajectories on the induced metrics, producing positive and negative traits.

coverage and redundancy meta-metrics then measure how well the induced metric set covers unseen human feedback.

Install

uv is the supported package manager.

git clone https://github.com/Open-Social-World/autolibra
cd autolibra
uv sync

AutoLibra talks to Azure OpenAI. Create a .env in the repo root:

AZURE_API_KEY=...
AZURE_ENDPOINT=https://<your-resource>.openai.azure.com/
AZURE_OPENAI_4O_MODEL=gpt-4o
AZURE_OPENAI_O3_MODEL=o3-mini
GITHUB_PERSONAL_ACCESS_TOKEN=...   # optional, for some GitHub-backed dataset loaders

Quickstart (Python API)

import asyncio
from openai import AsyncAzureOpenAI
from autolibra_core import (
    feedback_grounding,
    behavior_clustering,
    run_llm_eval,
    MetricTrainingInstance,
)

async def main():
    client = AsyncAzureOpenAI(azure_endpoint=..., api_key=..., api_version=...)

    # 1. Build MetricTrainingInstance objects from your trajectories + feedback.
    #    See `autolibra_core.datasets.*` for loaders; each yields instances like:
    #    MetricTrainingInstance(task=..., agent_id=..., trajectory=..., feedback=...)
    instances: list[MetricTrainingInstance] = load_my_data()

    # 2. Ground feedback into aspects per instance.
    aspects = []
    for inst in instances:
        aspects.extend(await feedback_grounding(inst, client))

    # 3. Cluster aspects into metrics with definitions + behavior examples.
    induced = await behavior_clustering(aspects, client)
    metrics = induced.metrics

    # 4. Run LLM-as-a-Judge on the trajectories with the induced metrics.
    results = await run_llm_eval(instances, metrics, client)
    print(metrics)
    print(results)

asyncio.run(main())

Datasets

Shared trajectories + feedback live on Hugging Face at open-social-world/autolibra. Install git-lfs first, then:

git clone https://huggingface.co/datasets/open-social-world/autolibra .data

Preprocess or convert from the original sources with the dataset loaders in autolibra_core.datasets (one module per dataset):

uv run python -m autolibra_core.datasets.webarena
uv run python -m autolibra_core.datasets.sotopia
uv run python -m autolibra_core.datasets.cogym
uv run python -m autolibra_core.datasets.balrog_babaisai
# ...

To contribute a new dataset, add a loader under packages/autolibra-core/src/autolibra_core/datasets/ that emits MetricTrainingInstance objects, then push the converted artifacts to the shared HF repo.

Collecting feedback

Two annotation frontends ship in src/tty/:

# Terminal UI
uv run python src/tty/tty_annotation.py \
    .data/webarena .data/annotations/webarena \
    --annotator-id <your-name>

# Streamlit UI (browser)
uv run streamlit run src/tty/tty_annotation.py \
    .data/sotopia .data/annotations/sotopia \
    -- --annotator-id <your-name> --use-streamlit

# Review annotations
uv run streamlit run src/tty/view_annotations.py \
    -- .data/annotations/sotopia/annotations

Annotators are shown each trajectory step by step and write a single piece of natural-language feedback per trajectory. See §2.1 of the paper for annotation guidelines.

Training / iterative metric induction

src/training/ contains the end-to-end scripts used in the paper, including the Gemini-based iterative induction pipeline from §4 ("AutoLibra as a Ladder"). See src/training/README.md for the per-script breakdown.

The prompt-optimization harness used on Baba-Is-AI (Fig. 4 in the paper) is in prompt_optimization/; environment submodules (BALROG, etc.) are declared in .gitmodules.

Repository layout

autolibra/
├── packages/
│   ├── autolibra-core/        # Core operators, evaluators, dataset loaders
│   └── osw-data/              # Trajectory / metric primitives
├── src/
│   ├── tty/                   # Annotation UIs (terminal + Streamlit)
│   ├── training/              # Iterative induction / eval pipelines
│   ├── tools/                 # Analysis utilities
│   └── plot/                  # Figure generation
├── prompt_optimization/       # Agent prompt-optimization harness
└── pyproject.toml

Contributing

Issues and PRs welcome. Before opening a PR:

uv run pre-commit run --all-files
uv run pytest

Citation

@inproceedings{zhu2026autolibra,
  title     = {AutoLibra: Agent Metric Induction from Open-Ended Human Feedback},
  author    = {Hao Zhu and Phil Cuvin and Xinkai Yu and
               Charlotte Ka Yee Yan and Jason Zhang and Diyi Yang},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=4BjGVZ7Bxn}
}

Acknowledgments

Supported by ONR N000142412532, NSF IIS-2247357, and DARPA Friction for Accountability in Conversational Transactions. Compute credits from Google Cloud Platform and Modal. Thanks to the Stanford SALT Lab for feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
packages		packages
prompt_optimization		prompt_optimization
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoLibra

Install

Quickstart (Python API)

Datasets

Collecting feedback

Training / iterative metric induction

Repository layout

Contributing

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoLibra

Install

Quickstart (Python API)

Datasets

Collecting feedback

Training / iterative metric induction

Repository layout

Contributing

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages