A standardized benchmark for evaluating procedural memory retrieval systems on task-oriented trajectories. Test how well your retrieval method captures procedural similarity across different task representations.
This benchmark evaluates retrieval systems on their ability to identify procedurally similar trajectories. Currently available: ALFWorld (text-based household tasks). Coming soon: OSWorld (real computer environments with visual and multimodal retrieval).
- 40 complexity-stratified queries (11 HARD, 14 MEDIUM, 15 EASY)
- 336 AgentInstruct trajectories with state-action pairs
- 6 ALFWorld task types (placement, heating, cooling, cleaning, examination, multi-object)
- LLM-as-judge evaluation with procedural similarity scoring
- Standard IR metrics (P@k, R@k, F1@k, NDCG@k, MAP)
Proced-Mem: Benchmarking Procedural Memory Retrieval in Language Agents Across Domains
Ishant K, Aswanth Krishnan
MemAgents Workshop @ ICLR 2026 (camera-ready forthcoming)
The ALFWorld sub-domain was first presented in our preliminary preprint: arXiv:2511.21730
If you use this benchmark in your work, please cite the paper (see Citation).
| Sub-domain | Status | Description |
|---|---|---|
| ALFWorld | Available | 336 trajectories, 40 queries, 6 text-only retrieval methods, LLM-as-judge evaluation |
| OSWorld | Coming Soon | 206 computer tasks across Chrome, LibreOffice, Terminal & GNOME. 7 retrieval methods spanning text, visual, and multimodal modalities. Hierarchical ground truth at two granularity levels (L1: 15 patterns, L2: 5 clusters). Leave-one-out evaluation protocol. |
β Plug-and-play interface - Implement 3 methods, get comprehensive evaluation β Complexity-stratified analysis - Performance breakdown by query difficulty β Two baseline implementations - State-aware and action-only embeddings β Method-neutral design - No hints toward specific retrieval approaches β Reproducible evaluation - Standardized queries and ground truth annotations
- Python 3.8+
- OpenAI API key (for LLM-as-judge evaluation)
# Install core package
pip install .
# Or install with baseline implementations
pip install .[baselines]
# Or install everything (baselines + LLM judge)
pip install .[all]cp .env.example .env
# Edit .env and add your OpenAI API keyfrom procedural_memory_benchmark import (
AgentInstructEmbeddingRetrieval,
BenchmarkRunner
)
# Initialize baseline retrieval system (state-aware embeddings)
print("π§ Setting up baseline retrieval system...")
retrieval = AgentInstructEmbeddingRetrieval()
retrieval.setup() # Builds database on first use (~1-2 minutes)
# Run benchmark on EASY tier (quick test)
print("π― Running benchmark evaluation...")
runner = BenchmarkRunner(retrieval, llm_model="gpt-4")
result = runner.run_benchmark(
complexity_tiers=["EASY"],
max_queries_per_tier=3, # Just 3 queries for quick test
save_results=True
)
# View results
runner.print_summary(result)Expected output:
π BENCHMARK RESULTS
System: agentinstruct_embedding
Queries evaluated: 3
Overall Metrics:
MAP: 0.84
P@1: 0.87
P@5: 0.73
NDCG@10: 0.86
Results saved to: results/benchmark_agentinstruct_embedding_20251120_195423.json
from procedural_memory_benchmark import RetrievalSystem, RetrievedTrajectory
class MyCustomRetrieval(RetrievalSystem):
"""Your custom retrieval implementation."""
def retrieve(self, query: str, k: int = 5) -> list[RetrievedTrajectory]:
"""
Retrieve top-k most relevant trajectories.
Args:
query: Task description (e.g., "Put a mug on the coffee maker")
k: Number of results to return
Returns:
List of RetrievedTrajectory objects sorted by relevance
"""
# YOUR RETRIEVAL LOGIC HERE
results = your_search_function(query, k)
# Convert to standardized format
return [
RetrievedTrajectory(
trajectory_id=r['id'],
task_instance_id=r['task_id'],
task_description=r['description'],
similarity_score=r['score'], # Your similarity score
total_steps=r['num_steps'],
document_text=r['full_trajectory'] # Full trajectory for LLM evaluation
)
for r in results
]
def get_system_name(self) -> str:
"""Return unique identifier for this system."""
return "my_custom_retrieval"
def get_system_info(self) -> dict:
"""Return system configuration metadata."""
return {
"method": "your_method_name",
"model": "your_model_if_applicable",
"corpus_size": 336,
"description": "Brief description of your approach"
}from procedural_memory_benchmark import BenchmarkRunner
# Initialize your system
retrieval = MyCustomRetrieval()
# Run full benchmark
runner = BenchmarkRunner(retrieval, llm_model="gpt-4")
result = runner.run_benchmark(
complexity_tiers=["HARD", "MEDIUM", "EASY"], # All tiers
save_results=True
)
# Analyze results
runner.print_summary(result)python tests/validate_retrieval.py MyCustomRetrieval- HARD (11 queries): Multi-object tasks, composite procedures, temperature transformations
- MEDIUM (14 queries): Heating tasks, standard placement with constraints
- EASY (15 queries): Simple placement, basic examination tasks
- Source: AgentInstruct dataset (ALFWorld-compatible tasks)
- Format: State-action pairs with task descriptions
- Coverage: 6 task types across various complexity levels
- Size: ~900KB bundled with package
- LLM-as-judge: GPT-4 scores procedural similarity (0-10 scale)
- Relevance threshold: Score β₯6 = relevant
- Metrics: Precision@k, Recall@k, F1@k, NDCG@k, MAP
- Analysis: Overall + complexity-stratified performance
Procedural memory captures how to perform tasks through action sequences. This benchmark evaluates whether retrieval systems can identify procedurally similar trajectories even when:
- Objects differ (apple vs mug vs plate)
- Locations differ (cabinet vs shelf vs drawer)
- Surface details vary but core procedure matches
Query: "Put a mug on the coffee maker"
Highly relevant (score 9-10):
- "Take cup from shelf, navigate to coffee maker, place on coffee maker"
- Objects differ (cup vs mug) but procedure identical β
Partially relevant (score 6-7):
- "Take mug from counter, heat in microwave"
- Shares mug-handling but different end goal
β οΈ
Not relevant (score 0-3):
- "Open cabinet and examine lamp"
- No procedural overlap β
procedural-memory-benchmark/
βββ procedural_memory_benchmark/ # Main package
β βββ benchmark/ # Core benchmark components
β β βββ retrieval_interface.py # Abstract interface + baselines
β β βββ benchmark_runner.py # Main orchestrator
β β βββ query_bank.py # Query management
β β βββ metrics_calculator.py # IR metrics
β β βββ complexity_analyzer.py # Complexity analysis
β β βββ data/
β β βββ query_bank.json # 40 stratified queries
β βββ agentinstruct/ # AgentInstruct components
β β βββ corpus_loader.py # Trajectory loading
β β βββ embedder.py # Embedding generation
β β βββ database.py # State-aware baseline
β β βββ database_actions_only.py # Action-only baseline
β βββ llm/ # LLM evaluation
β β βββ llm_reasoner.py # GPT-4 judge
β βββ utils/ # Utilities
β β βββ paths.py # Path resolution
β βββ data/
β βββ corpus/
β βββ agentinstruct_trajectories.json # 336 trajectories
βββ examples/ # Example implementations
βββ docs/ # Documentation
βββ tests/ # Validation tools
βββ README.md # This file
Method: Embeddings of task description + state-action pairs Model: all-MiniLM-L6-v2 (384-dim) Performance: MAP ~0.79, P@1 ~0.78 (full 40-query evaluation)
from procedural_memory_benchmark.benchmark import AgentInstructEmbeddingRetrieval
retrieval = AgentInstructEmbeddingRetrieval()
retrieval.setup() # First-time database buildMethod: Embeddings of pure action sequences (no states/descriptions) Model: all-MiniLM-L6-v2 (384-dim) Performance: MAP ~0.72, P@1 ~0.65 (full 40-query evaluation)
from procedural_memory_benchmark.benchmark import AgentInstructActionOnlyRetrieval
retrieval = AgentInstructActionOnlyRetrieval()
retrieval.setup()from procedural_memory_benchmark import QueryBank
bank = QueryBank()
bank.load()
# Get all queries
all_queries = bank.queries # 40 queries
# Filter by complexity
hard_queries = bank.get_by_complexity("HARD") # 11 queries
medium_queries = bank.get_by_complexity("MEDIUM") # 14 queries
easy_queries = bank.get_by_complexity("EASY") # 15 queries
# Filter by task type
examination_queries = bank.get_by_task_type(2) # Examination tasks
# View query structure
query = all_queries[0]
print(f"Query: {query.task_description}")
print(f"Tier: {query.complexity_tier}")
print(f"Task type: {query.task_type}")
print(f"Factors: {query.complexity_factors}")from procedural_memory_benchmark.agentinstruct import AgentInstructCorpusLoader
loader = AgentInstructCorpusLoader()
trajectories = loader.get_all_trajectories() # 336 trajectories
# View trajectory structure
traj = trajectories[0]
print(f"Task: {traj.task_description}")
print(f"Steps: {traj.total_steps}")
print(f"State-action pairs: {len(traj.state_action_pairs)}")
# Get formatted text for your retrieval system
embedding_text = traj.get_embedding_text() # Task + states + actions
action_sequence = traj.get_pure_action_sequence_text() # Actions only# Set environment variable
export OPENAI_API_KEY='your-key-here'
# Or create .env file
echo "OPENAI_API_KEY=your-key-here" > .env# For baseline systems, build database first
retrieval = AgentInstructEmbeddingRetrieval()
retrieval.setup() # This builds the database (~1-2 minutes)Databases are stored in user cache by default:
- Linux/Mac:
~/.cache/procedural_memory_benchmark/databases/ - Windows:
%LOCALAPPDATA%/procedural_memory_benchmark/databases/
Override with environment variable:
export PROCEDURAL_MEMORY_DB_PATH=/custom/pathResults are saved to ./results/ in your current working directory.
- API Reference: See docs/API_REFERENCE.md for complete interface documentation
- Examples: See examples/ for working implementations
- Validation: Use tests/validate_retrieval.py to test your implementation
@article{kohar2025procedural,
title={A Benchmark for Procedural Memory Retrieval in Language Agents},
author={Kohar, Ishant and Krishnan, Aswanth},
journal={arXiv preprint arXiv:2511.21730},
year={2025},
note={Extended version accepted at MemAgents Workshop @ ICLR 2026}
}Apache-2.0 License - See LICENSE file for details.
Contributions welcome! Please open an issue or pull request.
Questions? Open an issue on GitHub or check the documentation.