ML Feature Store Documentation

Transform technical books into engaging, learnable content

A comprehensive guide to building production ML systems with feature stores, from batch pipelines to real-time streaming, LLM agents, and observability.

📚 Table of Contents

Part I: Foundations

Chapter 1: Building Machine Learning Systems Learn the fundamentals of ML systems, the anatomy of production ML, and the evolution from stateless to stateful architectures.

Chapter 2: Machine Learning Pipelines Understand feature, training, and inference pipelines—the three pillars of production ML systems.

Chapter 3: Air Quality Prediction Service Build your first end-to-end ML system: a real-world air quality forecasting service with batch pipelines.

Chapter 4: Feature Stores Discover why feature stores are essential for ML systems, their history, and core concepts.

Chapter 5: Hopsworks Feature Store Get hands-on with Hopsworks: projects, feature groups, feature views, and the architecture of a modern feature store.

Part II: Feature Engineering

Chapter 6: Data Transformations Master the taxonomy of data transformations: stateless, stateful, model-independent, and model-dependent.

Chapter 7: Model-Dependent Transforms (MDTs) Learn on-demand transformations, vector similarity search, embeddings, and building RAG systems for LLMs.

Chapter 8: Batch Feature Pipelines Build scalable batch pipelines with PySpark, implement data validation, and handle point-in-time correctness with ASOF joins.

Chapter 9: Streaming and Real-Time Features Create streaming pipelines with Apache Flink and Feldera, implement rolling aggregations, and achieve subsecond feature freshness.

Part III: Training and Inference

Chapter 10: Training Pipelines Build reproducible training pipelines, create training datasets with point-in-time correctness, and implement experiment tracking.

Chapter 11: Inference Pipelines Deploy models for batch and online inference, implement caching strategies, and prevent online-offline skew.

Chapter 12: Agents and LLM Workflows Build LLM-powered agents with tool usage (MCP), implement RAG with feature stores, and create multi-step AI workflows.

Part IV: Production Operations

Chapter 13: Testing AI Systems Implement comprehensive testing strategies: unit tests, integration tests, data validation, and evals for LLMs.

Chapter 14: Observability and Monitoring Set up logging and metrics for ML models and LLM agents, monitor for drift, and implement guardrails.

Chapter 15: TikTok's Recommender System Study a real-world case: how TikTok builds the world's most valuable AI system with feature stores and real-time ML.

📊 Documentation Statistics

Total Chapters: 15
Total Words: ~109,000
Code Examples: 693
ASCII Diagrams: Extensive
Insight Boxes: 144

🎯 Learning Path

Beginner Track

Start here if you're new to ML systems:

Chapter 1 → Chapter 2 → Chapter 3
Chapter 4 → Chapter 5
Chapter 6 → Chapter 8
Chapter 10 → Chapter 11

Intermediate Track

For those familiar with ML basics:

Chapter 4 → Chapter 6 → Chapter 7
Chapter 8 → Chapter 9
Chapter 10 → Chapter 11 → Chapter 13

Advanced Track

For practitioners building production systems:

Chapter 7 → Chapter 9 (Real-time features + RAG)
Chapter 12 (LLM Agents)
Chapter 14 (Observability)
Chapter 15 (TikTok case study)

LLM/Agent Focused

Building AI systems with LLMs:

Chapter 7 (Embeddings + Vector similarity)
Chapter 12 (Agents and workflows)
Chapter 13 (Testing and evals)
Chapter 14 (Logging and guardrails)

🛠️ Technical Stack

The documentation covers these technologies:

Feature Stores:

Hopsworks
Feature engineering with Pandas, Polars, PySpark

Stream Processing:

Apache Flink (Java/SQL)
Feldera (SQL with incremental views)
Apache Kafka

ML Frameworks:

Scikit-Learn
XGBoost
TensorFlow/Keras
PyTorch

LLM/Agent Tools:

LangChain
Model Context Protocol (MCP)
Prompt engineering
RAG systems

Monitoring:

Prometheus
NannyML (drift detection)
Hopsworks logging

Infrastructure:

Apache Hudi (lakehouse)
RonDB (online feature store)
S3-compatible object stores
Kubernetes (KServe)

📖 How to Use This Documentation

Teaching Format

Each chapter follows a consistent learning structure:

Three-Level Explanations:

In plain English: Simple analogies for quick understanding
In technical terms: Precise definitions for depth
Why it matters: Real-world relevance

Visual Learning:

ASCII diagrams for architecture and data flow
Progressive examples from simple to complex
Code snippets with clear explanations

Insight Boxes:

> 💡 Insight
>
> Key patterns and best practices that connect concepts
> to broader ML system design principles.

Navigation

Each chapter has a numbered table of contents with anchor links
Previous/Next links at the bottom for sequential reading
Cross-references to related chapters throughout

Code Examples

All code examples are:

✅ Complete and runnable
✅ Explained with context
✅ Tested and validated
✅ Language-specific (Python, SQL, YAML, etc.)

🚀 Quick Start

For ML Engineers

Read Chapter 1 to understand ML systems
Follow Chapter 3 to build your first system
Deep dive into Chapter 6 for feature engineering

For Data Engineers

Start with Chapter 4 for feature store concepts
Learn Chapter 8 for batch processing
Explore Chapter 9 for real-time pipelines

For MLOps Engineers

Begin with Chapter 13 for testing strategies
Study Chapter 14 for production monitoring
Review Chapter 11 for deployment patterns

For LLM Developers

Jump to Chapter 7 for RAG fundamentals
Read Chapter 12 for building agents
Follow Chapter 14 sections on LLM observability

📝 Key Concepts Covered

Feature Engineering

Model-independent vs model-dependent transformations
Stateless vs stateful transformations
On-demand vs precomputed features
Point-in-time correctness with ASOF joins
Shift-left vs shift-right architectures

Real-Time ML

Streaming feature pipelines
Windowed aggregations (tumbling, hopping, rolling)
Incremental views for scalable aggregations
Event-time vs processing-time
Watermarks and late data handling

LLM Systems

Retrieval-Augmented Generation (RAG)
Vector similarity search and embeddings
Agents with tool usage (MCP)
Prompt engineering and templates
Evals and error analysis
Guardrails for safety

Production Operations

Monitoring for drift (feature, concept, prediction)
Logging transformed and untransformed features
Metrics for autoscaling
Testing strategies (unit, integration, evals)
CI/CD for ML pipelines

🔍 Search Tips

To find specific topics:

Feature stores basics: Chapters 4, 5
Batch processing: Chapter 8
Real-time/streaming: Chapter 9
Embeddings & RAG: Chapter 7
LLM agents: Chapter 12
Testing: Chapter 13
Monitoring & drift: Chapter 14
Production case study: Chapter 15

💡 Best Practices Highlighted

Throughout the documentation, you'll find insights on:

Architecture patterns: FTI pipelines, Lambda vs Kappa
Data quality: Validation, schema evolution, point-in-time correctness
Performance: Caching, incremental views, pushdown aggregations
Reliability: Testing, monitoring, drift detection
Scalability: Horizontal scaling, autoscaling, resource optimization
Maintainability: Code reuse, composable transformations, DRY principle

📚 Related Resources

This documentation is based on best practices from:

Feature Store Summit presentations
Production ML systems at scale (Uber, LinkedIn, TikTok)
Open source projects (Hopsworks, Apache Flink, Feldera)
LLM development practices (LangChain, OpenAI)

🎓 Prerequisites

Recommended knowledge:

Python programming (intermediate)
Basic SQL
Machine learning fundamentals
Understanding of APIs and microservices

Helpful but not required:

Distributed systems concepts
Data engineering experience
Docker/Kubernetes basics
Cloud platforms (AWS, GCP, Azure)

🤝 Contributing

This documentation follows the Technical Content Creation Guide (see CLAUDE.md):

Numbered hierarchical structure
Analogies before technical definitions
Progressive examples (simple → complex)
ASCII diagrams for visual learning
Working code examples with explanations

📜 License

Educational content for learning ML feature stores and production ML systems.

Ready to start? Begin with Chapter 1: Building Machine Learning Systems

Last updated: October 2025 Total documentation: ~979 KB Format: GitHub-flavored Markdown

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
docs		docs
raw		raw
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md

Folders and files

Latest commit

History

Repository files navigation

ML Feature Store Documentation

📚 Table of Contents

Part I: Foundations

Part II: Feature Engineering

Part III: Training and Inference

Part IV: Production Operations

📊 Documentation Statistics

🎯 Learning Path

Beginner Track

Intermediate Track

Advanced Track

LLM/Agent Focused

🛠️ Technical Stack

📖 How to Use This Documentation

Teaching Format

Navigation

Code Examples

🚀 Quick Start

For ML Engineers

For Data Engineers

For MLOps Engineers

For LLM Developers

📝 Key Concepts Covered

Feature Engineering

Real-Time ML

LLM Systems

Production Operations

🔍 Search Tips

💡 Best Practices Highlighted

📚 Related Resources

🎓 Prerequisites

🤝 Contributing

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages