Skip to content

YZXBiz/ml-feature-store

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ML Feature Store Documentation

Transform technical books into engaging, learnable content

A comprehensive guide to building production ML systems with feature stores, from batch pipelines to real-time streaming, LLM agents, and observability.


πŸ“š Table of Contents

Part I: Foundations

Chapter 1: Building Machine Learning Systems Learn the fundamentals of ML systems, the anatomy of production ML, and the evolution from stateless to stateful architectures.

Chapter 2: Machine Learning Pipelines Understand feature, training, and inference pipelinesβ€”the three pillars of production ML systems.

Chapter 3: Air Quality Prediction Service Build your first end-to-end ML system: a real-world air quality forecasting service with batch pipelines.

Chapter 4: Feature Stores Discover why feature stores are essential for ML systems, their history, and core concepts.

Chapter 5: Hopsworks Feature Store Get hands-on with Hopsworks: projects, feature groups, feature views, and the architecture of a modern feature store.


Part II: Feature Engineering

Chapter 6: Data Transformations Master the taxonomy of data transformations: stateless, stateful, model-independent, and model-dependent.

Chapter 7: Model-Dependent Transforms (MDTs) Learn on-demand transformations, vector similarity search, embeddings, and building RAG systems for LLMs.

Chapter 8: Batch Feature Pipelines Build scalable batch pipelines with PySpark, implement data validation, and handle point-in-time correctness with ASOF joins.

Chapter 9: Streaming and Real-Time Features Create streaming pipelines with Apache Flink and Feldera, implement rolling aggregations, and achieve subsecond feature freshness.


Part III: Training and Inference

Chapter 10: Training Pipelines Build reproducible training pipelines, create training datasets with point-in-time correctness, and implement experiment tracking.

Chapter 11: Inference Pipelines Deploy models for batch and online inference, implement caching strategies, and prevent online-offline skew.

Chapter 12: Agents and LLM Workflows Build LLM-powered agents with tool usage (MCP), implement RAG with feature stores, and create multi-step AI workflows.


Part IV: Production Operations

Chapter 13: Testing AI Systems Implement comprehensive testing strategies: unit tests, integration tests, data validation, and evals for LLMs.

Chapter 14: Observability and Monitoring Set up logging and metrics for ML models and LLM agents, monitor for drift, and implement guardrails.

Chapter 15: TikTok's Recommender System Study a real-world case: how TikTok builds the world's most valuable AI system with feature stores and real-time ML.


πŸ“Š Documentation Statistics

  • Total Chapters: 15
  • Total Words: ~109,000
  • Code Examples: 693
  • ASCII Diagrams: Extensive
  • Insight Boxes: 144

🎯 Learning Path

Beginner Track

Start here if you're new to ML systems:

  1. Chapter 1 β†’ Chapter 2 β†’ Chapter 3
  2. Chapter 4 β†’ Chapter 5
  3. Chapter 6 β†’ Chapter 8
  4. Chapter 10 β†’ Chapter 11

Intermediate Track

For those familiar with ML basics:

  1. Chapter 4 β†’ Chapter 6 β†’ Chapter 7
  2. Chapter 8 β†’ Chapter 9
  3. Chapter 10 β†’ Chapter 11 β†’ Chapter 13

Advanced Track

For practitioners building production systems:

  1. Chapter 7 β†’ Chapter 9 (Real-time features + RAG)
  2. Chapter 12 (LLM Agents)
  3. Chapter 14 (Observability)
  4. Chapter 15 (TikTok case study)

LLM/Agent Focused

Building AI systems with LLMs:

  1. Chapter 7 (Embeddings + Vector similarity)
  2. Chapter 12 (Agents and workflows)
  3. Chapter 13 (Testing and evals)
  4. Chapter 14 (Logging and guardrails)

πŸ› οΈ Technical Stack

The documentation covers these technologies:

Feature Stores:

  • Hopsworks
  • Feature engineering with Pandas, Polars, PySpark

Stream Processing:

  • Apache Flink (Java/SQL)
  • Feldera (SQL with incremental views)
  • Apache Kafka

ML Frameworks:

  • Scikit-Learn
  • XGBoost
  • TensorFlow/Keras
  • PyTorch

LLM/Agent Tools:

  • LangChain
  • Model Context Protocol (MCP)
  • Prompt engineering
  • RAG systems

Monitoring:

  • Prometheus
  • NannyML (drift detection)
  • Hopsworks logging

Infrastructure:

  • Apache Hudi (lakehouse)
  • RonDB (online feature store)
  • S3-compatible object stores
  • Kubernetes (KServe)

πŸ“– How to Use This Documentation

Teaching Format

Each chapter follows a consistent learning structure:

Three-Level Explanations:

  • In plain English: Simple analogies for quick understanding
  • In technical terms: Precise definitions for depth
  • Why it matters: Real-world relevance

Visual Learning:

  • ASCII diagrams for architecture and data flow
  • Progressive examples from simple to complex
  • Code snippets with clear explanations

Insight Boxes:

> πŸ’‘ Insight
>
> Key patterns and best practices that connect concepts
> to broader ML system design principles.

Navigation

  • Each chapter has a numbered table of contents with anchor links
  • Previous/Next links at the bottom for sequential reading
  • Cross-references to related chapters throughout

Code Examples

All code examples are:

  • βœ… Complete and runnable
  • βœ… Explained with context
  • βœ… Tested and validated
  • βœ… Language-specific (Python, SQL, YAML, etc.)

πŸš€ Quick Start

For ML Engineers

  1. Read Chapter 1 to understand ML systems
  2. Follow Chapter 3 to build your first system
  3. Deep dive into Chapter 6 for feature engineering

For Data Engineers

  1. Start with Chapter 4 for feature store concepts
  2. Learn Chapter 8 for batch processing
  3. Explore Chapter 9 for real-time pipelines

For MLOps Engineers

  1. Begin with Chapter 13 for testing strategies
  2. Study Chapter 14 for production monitoring
  3. Review Chapter 11 for deployment patterns

For LLM Developers

  1. Jump to Chapter 7 for RAG fundamentals
  2. Read Chapter 12 for building agents
  3. Follow Chapter 14 sections on LLM observability

πŸ“ Key Concepts Covered

Feature Engineering

  • Model-independent vs model-dependent transformations
  • Stateless vs stateful transformations
  • On-demand vs precomputed features
  • Point-in-time correctness with ASOF joins
  • Shift-left vs shift-right architectures

Real-Time ML

  • Streaming feature pipelines
  • Windowed aggregations (tumbling, hopping, rolling)
  • Incremental views for scalable aggregations
  • Event-time vs processing-time
  • Watermarks and late data handling

LLM Systems

  • Retrieval-Augmented Generation (RAG)
  • Vector similarity search and embeddings
  • Agents with tool usage (MCP)
  • Prompt engineering and templates
  • Evals and error analysis
  • Guardrails for safety

Production Operations

  • Monitoring for drift (feature, concept, prediction)
  • Logging transformed and untransformed features
  • Metrics for autoscaling
  • Testing strategies (unit, integration, evals)
  • CI/CD for ML pipelines

πŸ” Search Tips

To find specific topics:

  • Feature stores basics: Chapters 4, 5
  • Batch processing: Chapter 8
  • Real-time/streaming: Chapter 9
  • Embeddings & RAG: Chapter 7
  • LLM agents: Chapter 12
  • Testing: Chapter 13
  • Monitoring & drift: Chapter 14
  • Production case study: Chapter 15

πŸ’‘ Best Practices Highlighted

Throughout the documentation, you'll find insights on:

  • Architecture patterns: FTI pipelines, Lambda vs Kappa
  • Data quality: Validation, schema evolution, point-in-time correctness
  • Performance: Caching, incremental views, pushdown aggregations
  • Reliability: Testing, monitoring, drift detection
  • Scalability: Horizontal scaling, autoscaling, resource optimization
  • Maintainability: Code reuse, composable transformations, DRY principle

πŸ“š Related Resources

This documentation is based on best practices from:

  • Feature Store Summit presentations
  • Production ML systems at scale (Uber, LinkedIn, TikTok)
  • Open source projects (Hopsworks, Apache Flink, Feldera)
  • LLM development practices (LangChain, OpenAI)

πŸŽ“ Prerequisites

Recommended knowledge:

  • Python programming (intermediate)
  • Basic SQL
  • Machine learning fundamentals
  • Understanding of APIs and microservices

Helpful but not required:

  • Distributed systems concepts
  • Data engineering experience
  • Docker/Kubernetes basics
  • Cloud platforms (AWS, GCP, Azure)

🀝 Contributing

This documentation follows the Technical Content Creation Guide (see CLAUDE.md):

  • Numbered hierarchical structure
  • Analogies before technical definitions
  • Progressive examples (simple β†’ complex)
  • ASCII diagrams for visual learning
  • Working code examples with explanations

πŸ“œ License

Educational content for learning ML feature stores and production ML systems.


Ready to start? Begin with Chapter 1: Building Machine Learning Systems


Last updated: October 2025 Total documentation: ~979 KB Format: GitHub-flavored Markdown

About

Comprehensive guide to building and deploying ML Feature Stores - 15 chapters with interactive documentation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages