Skip to content

skhynix/Proced_mem_bench

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Procedural Memory Benchmark

A standardized benchmark for evaluating procedural memory retrieval systems on task-oriented trajectories. Test how well your retrieval method captures procedural similarity across different task representations.

πŸ“‹ Overview

This benchmark evaluates retrieval systems on their ability to identify procedurally similar trajectories. Currently available: ALFWorld (text-based household tasks). Coming soon: OSWorld (real computer environments with visual and multimodal retrieval).

  • 40 complexity-stratified queries (11 HARD, 14 MEDIUM, 15 EASY)
  • 336 AgentInstruct trajectories with state-action pairs
  • 6 ALFWorld task types (placement, heating, cooling, cleaning, examination, multi-object)
  • LLM-as-judge evaluation with procedural similarity scoring
  • Standard IR metrics (P@k, R@k, F1@k, NDCG@k, MAP)

πŸ“„ Paper

Proced-Mem: Benchmarking Procedural Memory Retrieval in Language Agents Across Domains
Ishant K, Aswanth Krishnan
MemAgents Workshop @ ICLR 2026 (camera-ready forthcoming)

The ALFWorld sub-domain was first presented in our preliminary preprint: arXiv:2511.21730

If you use this benchmark in your work, please cite the paper (see Citation).

Roadmap

Sub-domain Status Description
ALFWorld Available 336 trajectories, 40 queries, 6 text-only retrieval methods, LLM-as-judge evaluation
OSWorld Coming Soon 206 computer tasks across Chrome, LibreOffice, Terminal & GNOME. 7 retrieval methods spanning text, visual, and multimodal modalities. Hierarchical ground truth at two granularity levels (L1: 15 patterns, L2: 5 clusters). Leave-one-out evaluation protocol.

Key Features

βœ… Plug-and-play interface - Implement 3 methods, get comprehensive evaluation βœ… Complexity-stratified analysis - Performance breakdown by query difficulty βœ… Two baseline implementations - State-aware and action-only embeddings βœ… Method-neutral design - No hints toward specific retrieval approaches βœ… Reproducible evaluation - Standardized queries and ground truth annotations

πŸš€ Quick Start (5 minutes)

Prerequisites

  • Python 3.8+
  • OpenAI API key (for LLM-as-judge evaluation)

Installation

# Install core package
pip install .

# Or install with baseline implementations
pip install .[baselines]

# Or install everything (baselines + LLM judge)
pip install .[all]

Set up API key

cp .env.example .env
# Edit .env and add your OpenAI API key

Run baseline evaluation

from procedural_memory_benchmark import (
    AgentInstructEmbeddingRetrieval,
    BenchmarkRunner
)

# Initialize baseline retrieval system (state-aware embeddings)
print("πŸ”§ Setting up baseline retrieval system...")
retrieval = AgentInstructEmbeddingRetrieval()
retrieval.setup()  # Builds database on first use (~1-2 minutes)

# Run benchmark on EASY tier (quick test)
print("🎯 Running benchmark evaluation...")
runner = BenchmarkRunner(retrieval, llm_model="gpt-4")
result = runner.run_benchmark(
    complexity_tiers=["EASY"],
    max_queries_per_tier=3,  # Just 3 queries for quick test
    save_results=True
)

# View results
runner.print_summary(result)

Expected output:

πŸ“Š BENCHMARK RESULTS
System: agentinstruct_embedding
Queries evaluated: 3

Overall Metrics:
  MAP: 0.84
  P@1: 0.87
  P@5: 0.73
  NDCG@10: 0.86

Results saved to: results/benchmark_agentinstruct_embedding_20251120_195423.json

πŸ”§ Adding Your Custom Retrieval System

Step 1: Implement the interface (3 required methods)

from procedural_memory_benchmark import RetrievalSystem, RetrievedTrajectory

class MyCustomRetrieval(RetrievalSystem):
    """Your custom retrieval implementation."""

    def retrieve(self, query: str, k: int = 5) -> list[RetrievedTrajectory]:
        """
        Retrieve top-k most relevant trajectories.

        Args:
            query: Task description (e.g., "Put a mug on the coffee maker")
            k: Number of results to return

        Returns:
            List of RetrievedTrajectory objects sorted by relevance
        """
        # YOUR RETRIEVAL LOGIC HERE
        results = your_search_function(query, k)

        # Convert to standardized format
        return [
            RetrievedTrajectory(
                trajectory_id=r['id'],
                task_instance_id=r['task_id'],
                task_description=r['description'],
                similarity_score=r['score'],  # Your similarity score
                total_steps=r['num_steps'],
                document_text=r['full_trajectory']  # Full trajectory for LLM evaluation
            )
            for r in results
        ]

    def get_system_name(self) -> str:
        """Return unique identifier for this system."""
        return "my_custom_retrieval"

    def get_system_info(self) -> dict:
        """Return system configuration metadata."""
        return {
            "method": "your_method_name",
            "model": "your_model_if_applicable",
            "corpus_size": 336,
            "description": "Brief description of your approach"
        }

Step 2: Run the benchmark

from procedural_memory_benchmark import BenchmarkRunner

# Initialize your system
retrieval = MyCustomRetrieval()

# Run full benchmark
runner = BenchmarkRunner(retrieval, llm_model="gpt-4")
result = runner.run_benchmark(
    complexity_tiers=["HARD", "MEDIUM", "EASY"],  # All tiers
    save_results=True
)

# Analyze results
runner.print_summary(result)

Step 3: Validate your implementation

python tests/validate_retrieval.py MyCustomRetrieval

πŸ“Š Benchmark Structure

Query Bank (40 queries)

  • HARD (11 queries): Multi-object tasks, composite procedures, temperature transformations
  • MEDIUM (14 queries): Heating tasks, standard placement with constraints
  • EASY (15 queries): Simple placement, basic examination tasks

Corpus (336 trajectories)

  • Source: AgentInstruct dataset (ALFWorld-compatible tasks)
  • Format: State-action pairs with task descriptions
  • Coverage: 6 task types across various complexity levels
  • Size: ~900KB bundled with package

Evaluation

  • LLM-as-judge: GPT-4 scores procedural similarity (0-10 scale)
  • Relevance threshold: Score β‰₯6 = relevant
  • Metrics: Precision@k, Recall@k, F1@k, NDCG@k, MAP
  • Analysis: Overall + complexity-stratified performance

πŸ“– Key Concepts

What is procedural memory?

Procedural memory captures how to perform tasks through action sequences. This benchmark evaluates whether retrieval systems can identify procedurally similar trajectories even when:

  • Objects differ (apple vs mug vs plate)
  • Locations differ (cabinet vs shelf vs drawer)
  • Surface details vary but core procedure matches

Example: Procedural similarity

Query: "Put a mug on the coffee maker"

Highly relevant (score 9-10):

  • "Take cup from shelf, navigate to coffee maker, place on coffee maker"
  • Objects differ (cup vs mug) but procedure identical βœ…

Partially relevant (score 6-7):

  • "Take mug from counter, heat in microwave"
  • Shares mug-handling but different end goal ⚠️

Not relevant (score 0-3):

  • "Open cabinet and examine lamp"
  • No procedural overlap ❌

πŸ“ Repository Structure

procedural-memory-benchmark/
β”œβ”€β”€ procedural_memory_benchmark/     # Main package
β”‚   β”œβ”€β”€ benchmark/                   # Core benchmark components
β”‚   β”‚   β”œβ”€β”€ retrieval_interface.py  # Abstract interface + baselines
β”‚   β”‚   β”œβ”€β”€ benchmark_runner.py     # Main orchestrator
β”‚   β”‚   β”œβ”€β”€ query_bank.py           # Query management
β”‚   β”‚   β”œβ”€β”€ metrics_calculator.py   # IR metrics
β”‚   β”‚   β”œβ”€β”€ complexity_analyzer.py  # Complexity analysis
β”‚   β”‚   └── data/
β”‚   β”‚       └── query_bank.json     # 40 stratified queries
β”‚   β”œβ”€β”€ agentinstruct/               # AgentInstruct components
β”‚   β”‚   β”œβ”€β”€ corpus_loader.py        # Trajectory loading
β”‚   β”‚   β”œβ”€β”€ embedder.py             # Embedding generation
β”‚   β”‚   β”œβ”€β”€ database.py             # State-aware baseline
β”‚   β”‚   └── database_actions_only.py # Action-only baseline
β”‚   β”œβ”€β”€ llm/                         # LLM evaluation
β”‚   β”‚   └── llm_reasoner.py         # GPT-4 judge
β”‚   β”œβ”€β”€ utils/                       # Utilities
β”‚   β”‚   └── paths.py                # Path resolution
β”‚   └── data/
β”‚       └── corpus/
β”‚           └── agentinstruct_trajectories.json  # 336 trajectories
β”œβ”€β”€ examples/                        # Example implementations
β”œβ”€β”€ docs/                            # Documentation
β”œβ”€β”€ tests/                           # Validation tools
└── README.md                        # This file

🎯 Baseline Systems

AgentInstructEmbeddingRetrieval (State-aware)

Method: Embeddings of task description + state-action pairs Model: all-MiniLM-L6-v2 (384-dim) Performance: MAP ~0.79, P@1 ~0.78 (full 40-query evaluation)

from procedural_memory_benchmark.benchmark import AgentInstructEmbeddingRetrieval

retrieval = AgentInstructEmbeddingRetrieval()
retrieval.setup()  # First-time database build

AgentInstructActionOnlyRetrieval (Action-only)

Method: Embeddings of pure action sequences (no states/descriptions) Model: all-MiniLM-L6-v2 (384-dim) Performance: MAP ~0.72, P@1 ~0.65 (full 40-query evaluation)

from procedural_memory_benchmark.benchmark import AgentInstructActionOnlyRetrieval

retrieval = AgentInstructActionOnlyRetrieval()
retrieval.setup()

πŸ” Accessing Query Bank Data

from procedural_memory_benchmark import QueryBank

bank = QueryBank()
bank.load()

# Get all queries
all_queries = bank.queries  # 40 queries

# Filter by complexity
hard_queries = bank.get_by_complexity("HARD")  # 11 queries
medium_queries = bank.get_by_complexity("MEDIUM")  # 14 queries
easy_queries = bank.get_by_complexity("EASY")  # 15 queries

# Filter by task type
examination_queries = bank.get_by_task_type(2)  # Examination tasks

# View query structure
query = all_queries[0]
print(f"Query: {query.task_description}")
print(f"Tier: {query.complexity_tier}")
print(f"Task type: {query.task_type}")
print(f"Factors: {query.complexity_factors}")

πŸ” Accessing Corpus Data

from procedural_memory_benchmark.agentinstruct import AgentInstructCorpusLoader

loader = AgentInstructCorpusLoader()
trajectories = loader.get_all_trajectories()  # 336 trajectories

# View trajectory structure
traj = trajectories[0]
print(f"Task: {traj.task_description}")
print(f"Steps: {traj.total_steps}")
print(f"State-action pairs: {len(traj.state_action_pairs)}")

# Get formatted text for your retrieval system
embedding_text = traj.get_embedding_text()  # Task + states + actions
action_sequence = traj.get_pure_action_sequence_text()  # Actions only

❓ Troubleshooting

"OpenAI API key not found"

# Set environment variable
export OPENAI_API_KEY='your-key-here'

# Or create .env file
echo "OPENAI_API_KEY=your-key-here" > .env

"ChromaDB database not available"

# For baseline systems, build database first
retrieval = AgentInstructEmbeddingRetrieval()
retrieval.setup()  # This builds the database (~1-2 minutes)

Database location

Databases are stored in user cache by default:

  • Linux/Mac: ~/.cache/procedural_memory_benchmark/databases/
  • Windows: %LOCALAPPDATA%/procedural_memory_benchmark/databases/

Override with environment variable:

export PROCEDURAL_MEMORY_DB_PATH=/custom/path

Results location

Results are saved to ./results/ in your current working directory.

πŸ“š Further Documentation

Citation

@article{kohar2025procedural,
  title={A Benchmark for Procedural Memory Retrieval in Language Agents},
  author={Kohar, Ishant and Krishnan, Aswanth},
  journal={arXiv preprint arXiv:2511.21730},
  year={2025},
  note={Extended version accepted at MemAgents Workshop @ ICLR 2026}
}

πŸ“„ License

Apache-2.0 License - See LICENSE file for details.

🀝 Contributing

Contributions welcome! Please open an issue or pull request.


Questions? Open an issue on GitHub or check the documentation.

About

Procedural memory retrieval benchmark for agents. Includes dual corpora, coverage-balanced queries, retrieval pipelines, and LLM-based evaluation to measure how well systems match structurally similar procedures under vocabulary shift.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%