Transform technical books into engaging, learnable content
A comprehensive guide to building production ML systems with feature stores, from batch pipelines to real-time streaming, LLM agents, and observability.
Chapter 1: Building Machine Learning Systems Learn the fundamentals of ML systems, the anatomy of production ML, and the evolution from stateless to stateful architectures.
Chapter 2: Machine Learning Pipelines Understand feature, training, and inference pipelinesβthe three pillars of production ML systems.
Chapter 3: Air Quality Prediction Service Build your first end-to-end ML system: a real-world air quality forecasting service with batch pipelines.
Chapter 4: Feature Stores Discover why feature stores are essential for ML systems, their history, and core concepts.
Chapter 5: Hopsworks Feature Store Get hands-on with Hopsworks: projects, feature groups, feature views, and the architecture of a modern feature store.
Chapter 6: Data Transformations Master the taxonomy of data transformations: stateless, stateful, model-independent, and model-dependent.
Chapter 7: Model-Dependent Transforms (MDTs) Learn on-demand transformations, vector similarity search, embeddings, and building RAG systems for LLMs.
Chapter 8: Batch Feature Pipelines Build scalable batch pipelines with PySpark, implement data validation, and handle point-in-time correctness with ASOF joins.
Chapter 9: Streaming and Real-Time Features Create streaming pipelines with Apache Flink and Feldera, implement rolling aggregations, and achieve subsecond feature freshness.
Chapter 10: Training Pipelines Build reproducible training pipelines, create training datasets with point-in-time correctness, and implement experiment tracking.
Chapter 11: Inference Pipelines Deploy models for batch and online inference, implement caching strategies, and prevent online-offline skew.
Chapter 12: Agents and LLM Workflows Build LLM-powered agents with tool usage (MCP), implement RAG with feature stores, and create multi-step AI workflows.
Chapter 13: Testing AI Systems Implement comprehensive testing strategies: unit tests, integration tests, data validation, and evals for LLMs.
Chapter 14: Observability and Monitoring Set up logging and metrics for ML models and LLM agents, monitor for drift, and implement guardrails.
Chapter 15: TikTok's Recommender System Study a real-world case: how TikTok builds the world's most valuable AI system with feature stores and real-time ML.
- Total Chapters: 15
- Total Words: ~109,000
- Code Examples: 693
- ASCII Diagrams: Extensive
- Insight Boxes: 144
Start here if you're new to ML systems:
- Chapter 1 β Chapter 2 β Chapter 3
- Chapter 4 β Chapter 5
- Chapter 6 β Chapter 8
- Chapter 10 β Chapter 11
For those familiar with ML basics:
- Chapter 4 β Chapter 6 β Chapter 7
- Chapter 8 β Chapter 9
- Chapter 10 β Chapter 11 β Chapter 13
For practitioners building production systems:
- Chapter 7 β Chapter 9 (Real-time features + RAG)
- Chapter 12 (LLM Agents)
- Chapter 14 (Observability)
- Chapter 15 (TikTok case study)
Building AI systems with LLMs:
- Chapter 7 (Embeddings + Vector similarity)
- Chapter 12 (Agents and workflows)
- Chapter 13 (Testing and evals)
- Chapter 14 (Logging and guardrails)
The documentation covers these technologies:
Feature Stores:
- Hopsworks
- Feature engineering with Pandas, Polars, PySpark
Stream Processing:
- Apache Flink (Java/SQL)
- Feldera (SQL with incremental views)
- Apache Kafka
ML Frameworks:
- Scikit-Learn
- XGBoost
- TensorFlow/Keras
- PyTorch
LLM/Agent Tools:
- LangChain
- Model Context Protocol (MCP)
- Prompt engineering
- RAG systems
Monitoring:
- Prometheus
- NannyML (drift detection)
- Hopsworks logging
Infrastructure:
- Apache Hudi (lakehouse)
- RonDB (online feature store)
- S3-compatible object stores
- Kubernetes (KServe)
Each chapter follows a consistent learning structure:
Three-Level Explanations:
- In plain English: Simple analogies for quick understanding
- In technical terms: Precise definitions for depth
- Why it matters: Real-world relevance
Visual Learning:
- ASCII diagrams for architecture and data flow
- Progressive examples from simple to complex
- Code snippets with clear explanations
Insight Boxes:
> π‘ Insight
>
> Key patterns and best practices that connect concepts
> to broader ML system design principles.
- Each chapter has a numbered table of contents with anchor links
- Previous/Next links at the bottom for sequential reading
- Cross-references to related chapters throughout
All code examples are:
- β Complete and runnable
- β Explained with context
- β Tested and validated
- β Language-specific (Python, SQL, YAML, etc.)
- Read Chapter 1 to understand ML systems
- Follow Chapter 3 to build your first system
- Deep dive into Chapter 6 for feature engineering
- Start with Chapter 4 for feature store concepts
- Learn Chapter 8 for batch processing
- Explore Chapter 9 for real-time pipelines
- Begin with Chapter 13 for testing strategies
- Study Chapter 14 for production monitoring
- Review Chapter 11 for deployment patterns
- Jump to Chapter 7 for RAG fundamentals
- Read Chapter 12 for building agents
- Follow Chapter 14 sections on LLM observability
- Model-independent vs model-dependent transformations
- Stateless vs stateful transformations
- On-demand vs precomputed features
- Point-in-time correctness with ASOF joins
- Shift-left vs shift-right architectures
- Streaming feature pipelines
- Windowed aggregations (tumbling, hopping, rolling)
- Incremental views for scalable aggregations
- Event-time vs processing-time
- Watermarks and late data handling
- Retrieval-Augmented Generation (RAG)
- Vector similarity search and embeddings
- Agents with tool usage (MCP)
- Prompt engineering and templates
- Evals and error analysis
- Guardrails for safety
- Monitoring for drift (feature, concept, prediction)
- Logging transformed and untransformed features
- Metrics for autoscaling
- Testing strategies (unit, integration, evals)
- CI/CD for ML pipelines
To find specific topics:
- Feature stores basics: Chapters 4, 5
- Batch processing: Chapter 8
- Real-time/streaming: Chapter 9
- Embeddings & RAG: Chapter 7
- LLM agents: Chapter 12
- Testing: Chapter 13
- Monitoring & drift: Chapter 14
- Production case study: Chapter 15
Throughout the documentation, you'll find insights on:
- Architecture patterns: FTI pipelines, Lambda vs Kappa
- Data quality: Validation, schema evolution, point-in-time correctness
- Performance: Caching, incremental views, pushdown aggregations
- Reliability: Testing, monitoring, drift detection
- Scalability: Horizontal scaling, autoscaling, resource optimization
- Maintainability: Code reuse, composable transformations, DRY principle
This documentation is based on best practices from:
- Feature Store Summit presentations
- Production ML systems at scale (Uber, LinkedIn, TikTok)
- Open source projects (Hopsworks, Apache Flink, Feldera)
- LLM development practices (LangChain, OpenAI)
Recommended knowledge:
- Python programming (intermediate)
- Basic SQL
- Machine learning fundamentals
- Understanding of APIs and microservices
Helpful but not required:
- Distributed systems concepts
- Data engineering experience
- Docker/Kubernetes basics
- Cloud platforms (AWS, GCP, Azure)
This documentation follows the Technical Content Creation Guide (see CLAUDE.md):
- Numbered hierarchical structure
- Analogies before technical definitions
- Progressive examples (simple β complex)
- ASCII diagrams for visual learning
- Working code examples with explanations
Educational content for learning ML feature stores and production ML systems.
Ready to start? Begin with Chapter 1: Building Machine Learning Systems
Last updated: October 2025 Total documentation: ~979 KB Format: GitHub-flavored Markdown