Conversation
9355586 to
9cbf7e6
Compare
This commit adds the core UNDO logging system for PostgreSQL, implementing ZHeap-inspired physical UNDO with Compensation Log Records (CLRs) for crash-safe transaction rollback and standby replication support. Key features: - Physical UNDO application using memcpy() for direct page modification - CLR (Compensation Log Record) generation during transaction rollback - Shared buffer integration (UNDO pages use standard buffer pool) - UndoRecordSet architecture with chunk-based organization - UNDO worker for automatic cleanup of old records - Per-persistence-level record sets (permanent/unlogged/temp) Architecture: - UNDO logs stored in $PGDATA/base/undo/ with 64-bit UndoRecPtr - 40-bit offset (1TB per log) + 24-bit log number (16M logs) - Integrated with PostgreSQL's shared_buffers (no separate cache) - WAL-logged CLRs ensure crash safety and standby replay
Extends UNDO adding a per-relation model that can record logical operations for the purposed of recovery or in support of MVCC visibility tracking. Unlike cluster-wide UNDO (which stores complete tuple data globally), per-relation UNDO stores logical operation metadata in a relation-specific UNDO fork. Architecture: - Separate UNDO fork per relation (relfilenode.undo) - Metapage (block 0) tracks head/tail/free chain pointers - Data pages contain UNDO records with operation metadata - WAL resource manager (RM_RELUNDO_ID) for crash recovery - Two-phase protocol: RelUndoReserve() / RelUndoFinish() / RelUndoCancel() Record types: - RELUNDO_INSERT: Tracks inserted TID range - RELUNDO_DELETE: Tracks deleted TID - RELUNDO_UPDATE: Tracks old/new TID pair - RELUNDO_TUPLE_LOCK: Tracks tuple lock acquisition - RELUNDO_DELTA_INSERT: Tracks columnar delta insertion Table AM integration: - relation_init_undo: Create UNDO fork during CREATE TABLE - tuple_satisfies_snapshot_undo: MVCC visibility via UNDO chain - relation_vacuum_undo: Discard old UNDO records during VACUUM This complements cluster-wide UNDO by providing table-AM-specific UNDO management without global coordination overhead.
Implements a minimal table access method that exercises the per-relation UNDO subsystem. Validates end-to-end functionality: UNDO fork creation, record insertion, chain walking, and crash recovery. Implemented operations: - INSERT: Full implementation with UNDO record creation - Sequential scan: Forward-only table scan - CREATE/DROP TABLE: UNDO fork lifecycle management - VACUUM: UNDO record discard This test AM stores tuples in simple heap-like pages using custom TestUndoTamTupleHeader (t_len, t_xmin, t_self) followed by MinimalTuple data. Pages use standard PageHeaderData and PageAddItem(). Two-phase UNDO protocol demonstration: 1. Insert tuple onto data page (PageAddItem) 2. Reserve UNDO space (RelUndoReserve) 3. Build UNDO record (header + payload) 4. Commit UNDO record (RelUndoFinish) 5. Register for rollback (RegisterPerRelUndo) Introspection: - test_undo_tam_dump_chain(regclass): Walk UNDO fork, return all records Testing: - sql/undo_tam.sql: Basic INSERT/scan operations - t/058_undo_tam_crash.pl: Crash recovery validation This test module is NOT suitable for production use. It serves only to validate the per-relation UNDO infrastructure and demonstrate table AM integration patterns.
Extends per-relation UNDO from metadata-only (MVCC visibility) to supporting transaction rollback. When a transaction aborts, per-relation UNDO chains are applied asynchronously by background workers. Architecture: - Async-only rollback via background worker pool - Work queue protected by RelUndoWorkQueueLock - Catalog access safe in worker (proper transaction state) - Test helper (RelUndoProcessPendingSync) for deterministic testing Extended data structures: - RelUndoRecordHeader gains info_flags and tuple_len - RELUNDO_INFO_HAS_TUPLE flag indicates tuple data present - RELUNDO_INFO_HAS_CLR / CLR_APPLIED for crash safety Rollback operations: - RELUNDO_INSERT: Mark inserted tuples as LP_UNUSED - RELUNDO_DELETE: Restore deleted tuple via memcpy (stored in UNDO) - RELUNDO_UPDATE: Restore old tuple version (stored in UNDO) - RELUNDO_TUPLE_LOCK: Remove lock marker - RELUNDO_DELTA_INSERT: Restore original column data Transaction integration: - RegisterPerRelUndo: Track relation UNDO chains per transaction - GetPerRelUndoPtr: Chain UNDO records within relation - ApplyPerRelUndo: Queue work for background workers on abort - StartRelUndoWorker: Spawn worker if none running Async rationale: Per-relation UNDO cannot apply synchronously during ROLLBACK because catalog access (relation_open) is not allowed during TRANS_ABORT state. Background workers execute in proper transaction context, avoiding the constraint. This matches the ZHeap architecture where UNDO application is deferred to background processes. WAL: - XLOG_RELUNDO_APPLY: Compensation log records (CLRs) for applied UNDO - Prevents double-application after crash recovery Testing: - sql/undo_tam_rollback.sql: Validates INSERT rollback - test_undo_tam_process_pending(): Drain work queue synchronously
Implements production-ready WAL features for the per-relation UNDO
resource manager: async I/O, consistency checking, parallel redo,
and compression validation.
Async I/O optimization:
When INSERT records reference both data page (block 0) and metapage
(block 1), issue prefetch for block 1 before reading block 0. This
allows both I/Os to proceed in parallel, reducing crash recovery stall
time. Uses pgaio batch mode when io_method is worker or io_uring.
Pattern:
if (has_metapage && io_method != IOMETHOD_SYNC)
pgaio_enter_batchmode();
relundo_prefetch_block(record, 1); // Start async read
process_block_0(); // Overlaps with metapage I/O
process_block_1(); // Should be in cache
pgaio_exit_batchmode();
Consistency checking:
All redo functions validate WAL record fields before application:
- Bounds checks: offsets < BLCKSZ, counters within range
- Monotonicity: counters advance, pd_lower increases
- Cross-field validation: record fits within page
- Type validation: record types in valid range
- Post-condition checks: updated values are reasonable
Parallel redo support:
Implements startup/cleanup/mask callbacks required for multi-core
crash recovery:
- relundo_startup: Initialize per-backend state
- relundo_cleanup: Release per-backend resources
- relundo_mask: Mask LSN, checksum, free space for page comparison
Page dependency rules:
- Different pages replay in parallel (no ordering constraints)
- Same page: INIT precedes INSERT (enforced by page LSN)
- Metapage updates are sequential (buffer lock serialization)
Compression validation:
WAL compression (wal_compression GUC) automatically compresses full
page images via XLogCompressBackupBlock(). Test validates 40-46%
reduction for RELUNDO FPIs with lz4, pglz, and zstd algorithms.
Test: t/059_relundo_wal_compression.pl measures WAL volume with/without
compression for identical workloads.
…UNDO completions Implement the UNDO subsystem changes needed for Constant-Time Recovery (CTR). At abort time, transactions register in the Aborted Transaction Map (ATM) for O(1) visibility checks instead of performing synchronous rollback. A background Logical Revert worker lazily applies UNDO chains from ATM entries. Specifically: - Add ATM shared-memory structure with 16 LWLock partitions, WAL-logged add/forget operations, and a redo handler (new resource manager RM_ATM_ID). - Add Logical Revert background worker that scans ATM for unreverted entries, applies their per-relation UNDO chains, then removes them. - Complete tuple data storage in per-relation UNDO records via new RelUndoFinishWithTuple() write path and working RelUndoReadRecordWithTuple() read path. - Enable and complete rollback functions for all five record types (INSERT, DELETE, UPDATE, TUPLE_LOCK, DELTA_INSERT) in RelUndoApplyChain(), removing #ifdef NOT_USED guards. - Wire in per-relation CLR (Compensation Log Record) support for crash-safe Logical Revert: each applied UNDO record gets a CLR so recovery skips already-applied operations. - Modify abort path in ApplyPerRelUndo() to try ATM insertion first, falling back to synchronous rollback only when ATM is full. - Call ATMRecoveryFinalize() after WAL redo to log unreverted entry count for the Logical Revert worker to process.
This commit provides examples and architectural documentation for the UNDO subsystems. It is intended for reviewers and committers to understand the design decisions and usage patterns. Contents: - 01-basic-undo-setup.sql: Cluster-wide UNDO basics - 02-undo-rollback.sql: Rollback demonstrations - 03-undo-subtransactions.sql: Subtransaction handling - 04-transactional-fileops.sql: FILEOPS usage - 05-undo-monitoring.sql: Monitoring and statistics - 06-per-relation-undo.sql: Per-relation UNDO with test_undo_tam - DESIGN_NOTES.md: Comprehensive architecture documentation - README.md: Examples overview This commit should NOT be merged. It exists only to provide context and documentation for the patch series.
Introduce the IndexPrune framework that allows index access methods to register callbacks for proactively pruning dead index entries when UNDO records are discarded. This avoids accumulating dead tuples that would otherwise require VACUUM to clean up. Key components: - index_prune.h: IndexPruneCallbacks structure and registration API - index_prune.c: Registry management and IndexPruneNotifyDiscard() dispatcher - relundo_discard.c: Hook to call IndexPruneNotifyDiscard on UNDO discard Individual index AM implementations follow in subsequent commits.
Placeholder for index pruning design documentation. To be populated when design notes are split by subsystem.
Register IndexPrune callbacks in the B-tree access method handler. nbtprune.c implements dead-entry detection and removal using UNDO discard notifications, allowing proactive cleanup without full VACUUM.
Register IndexPrune callbacks in the hash access method handler. hashprune.c implements dead-entry detection and removal using UNDO discard notifications for hash indexes.
Register IndexPrune callbacks in the GIN access method handler. ginprune.c implements dead-entry detection and removal using UNDO discard notifications for GIN indexes.
Register IndexPrune callbacks in the GiST access method handler. gistprune.c implements dead-entry detection and removal using UNDO discard notifications for GiST indexes.
Register IndexPrune callbacks in the SP-GiST access method handler. spgprune.c implements dead-entry detection and removal using UNDO discard notifications for SP-GiST indexes.
Add VACUUM statistics tracking for UNDO-pruned index entries and verbose output. Include comprehensive test suite exercising index pruning across all supported index access methods via test_undo_tam.
Introduce the FILEOPS deferred-operations infrastructure following the Berkeley DB fileops.src model. Each filesystem operation is a composable unit with its own WAL record type, redo handler, and descriptor. This commit provides the core machinery only - no specific operations: - PendingFileOp linked list for deferred operations - FileOpsDoPendingOps() executor at transaction commit/abort - Subtransaction support (AtSubCommit/AtSubAbort/PostPrepare) - WAL resource manager shell (RM_FILEOPS_ID) - Platform portability layer (fsync_parent, FileOpsSync) - GUC: enable_transactional_fileops - Transaction lifecycle hooks in xact.c Individual operations (CREATE, DELETE, RENAME, WRITE, TRUNCATE, etc.) are added in subsequent commits.
Implement transactional file creation (BDB: __fop_create). Files are created immediately so they can be used within the transaction. If register_delete is true, the file is automatically deleted on abort. API: FileOpsCreate(path, flags, mode, register_delete) -> fd WAL: XLOG_FILEOPS_CREATE with idempotent redo (creates parent dirs if missing on standbys).
Implement deferred file deletion (BDB: __fop_remove). Deletion is scheduled for transaction commit or abort, not executed immediately. API: FileOpsDelete(path, at_commit) -> void WAL: XLOG_FILEOPS_DELETE (intentional no-op during redo; deletion driven by XACT commit/abort records). On Windows: uses pgunlink() with retry on EACCES.
Implement deferred file rename (BDB: __fop_rename). The rename is scheduled for commit time using durable_rename() which handles fsync ordering on Unix and MoveFileEx with retry on Windows. API: FileOpsRename(oldpath, newpath) -> int WAL: XLOG_FILEOPS_RENAME (intentional no-op during redo).
Implement WAL-logged file write at offset (BDB: __fop_write). Data is written immediately using pwrite() and fsynced for durability. API: FileOpsWrite(path, offset, data, len) -> int WAL: XLOG_FILEOPS_WRITE with redo that replays the write. On Windows: uses SetFilePointerEx + WriteFile via pg_pwrite.
Implement WAL-logged file truncation. Executed immediately with XLogFlush before the irreversible operation (following SMGR_TRUNCATE pattern). Uses ftruncate() on POSIX, SetEndOfFile() on Windows. API: FileOpsTruncate(path, length) -> void WAL: XLOG_FILEOPS_TRUNCATE with redo that replays the truncation.
Implement WAL-logged file metadata operations. CHMOD: chmod() on POSIX, _chmod() on Windows with limited mode bits (only _S_IREAD/_S_IWRITE; no group/other support). CHOWN: chown() on POSIX, no-op with WARNING on Windows (Windows uses ACLs for ownership, not uid/gid). Both execute immediately and are WAL-logged for crash recovery.
MKDIR: Immediate execution using MakePGDirectory(). Registers rmdir-on-abort for automatic cleanup on rollback. On Windows: _mkdir() (no mode parameter, permissions inherited from parent). RMDIR: Deferred to commit time (like DELETE). Uses rmdir() on all platforms, _rmdir() on Windows.
SYMLINK: Immediate execution. Uses symlink() on POSIX, pgsymlink() (NTFS junction points) on Windows. Registers delete-on-abort. LINK: Immediate execution. Uses link() on POSIX, CreateHardLinkA() on Windows (NTFS only). Registers delete-on-abort. Both create links idempotently during redo (unlink first if exists).
Add extended attribute operations to the transactional file operations
framework, completing the Berkeley DB fileops.src operation set.
FileOpsSetXattr() and FileOpsRemoveXattr() provide immediate execution
with WAL logging for crash recovery replay. A new cross-platform
portability layer (src/port/pg_xattr.c) abstracts platform differences:
- Linux: <sys/xattr.h> setxattr/removexattr
- macOS: <sys/xattr.h> with extra options parameter
- FreeBSD: <sys/extattr.h> extattr_set_file/extattr_delete_file
- Windows: NTFS Alternate Data Streams via CreateFileA("path:name")
- Fallback: returns ENOTSUP (operation succeeds in WAL but no-op
on unsupported platforms for WAL stream portability)
Platform detection uses compiler-defined macros (__linux__, __APPLE__,
__FreeBSD__, WIN32) rather than configure-time checks, avoiding
meson.build/configure.ac complexity.
Add regression tests for all FILEOPS operations (CREATE, DELETE, RENAME, WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK, SETXATTR, REMOVEXATTR) and a crash recovery test for WAL replay. Update the transactional fileops example script with the expanded operation set following the Berkeley DB fileops.src model.
Adds opt-in UNDO support to the standard heap table access method.
When enabled, heap operations write UNDO records to enable physical
rollback without scanning the heap, and support UNDO-based MVCC
visibility determination.
How heap uses UNDO:
INSERT operations:
- Before inserting tuple, call PrepareXactUndoData() to reserve UNDO space
- Write UNDO record with: transaction ID, tuple TID, old tuple data (null for INSERT)
- On abort: UndoReplay() marks tuple as LP_UNUSED without heap scan
UPDATE operations:
- Write UNDO record with complete old tuple version before update
- On abort: UndoReplay() restores old tuple version from UNDO
DELETE operations:
- Write UNDO record with complete deleted tuple data
- On abort: UndoReplay() resurrects tuple from UNDO record
MVCC visibility:
- Tuples reference UNDO chain via xmin/xmax
- HeapTupleSatisfiesSnapshot() can walk UNDO chain for older versions
- Enables reconstructing tuple state as of any snapshot
Configuration:
CREATE TABLE t (...) WITH (enable_undo=on);
The enable_undo storage parameter is per-table and defaults to off for
backward compatibility. When disabled, heap behaves exactly as before.
Value proposition:
1. Faster rollback: No heap scan required, UNDO chains are sequential
- Traditional abort: Full heap scan to mark tuples invalid (O(n) random I/O)
- UNDO abort: Sequential UNDO log scan (O(n) sequential I/O, better cache locality)
2. Cleaner abort handling: UNDO records are self-contained
- No need to track which heap pages were modified
- Works across crashes (UNDO is WAL-logged)
3. Foundation for future features:
- Multi-version concurrency control without bloat
- Faster VACUUM (can discard entire UNDO segments)
- Point-in-time recovery improvements
Trade-offs:
Costs:
- Additional writes: Every DML writes both heap + UNDO (roughly 2x write amplification)
- UNDO log space: Requires space for UNDO records until no longer visible
- Complexity: New GUCs (undo_retention, max_undo_workers), monitoring needed
Benefits:
- Primarily valuable for workloads with:
- Frequent aborts (e.g., speculative execution, deadlocks)
- Long-running transactions needing old snapshots
- Hot UPDATE workloads benefiting from cleaner rollback
Not recommended for:
- Bulk load workloads (COPY: 2x write amplification without abort benefit)
- Append-only tables (rare aborts mean cost without benefit)
- Space-constrained systems (UNDO retention increases storage)
When beneficial:
- OLTP with high abort rates (>5%)
- Systems with aggressive pruning needs (frequent VACUUM)
- Workloads requiring historical visibility (audit, time-travel queries)
Integration points:
- heap_insert/update/delete call PrepareXactUndoData/InsertXactUndoData
- Heap pruning respects undo_retention to avoid discarding needed UNDO
- pg_upgrade compatibility: UNDO disabled for upgraded tables
Background workers:
- Cluster-wide UNDO has async workers for cleanup/discard of old UNDO records
- Rollback itself is synchronous (via UndoReplay() during transaction abort)
- Workers periodically trim UNDO logs based on undo_retention and snapshot visibility
This demonstrates cluster-wide UNDO in production use. Note that this
differs from per-relation logical UNDO (added in subsequent patches),
which uses per-table UNDO forks and async rollback via background
workers.
Document the cluster-wide UNDO architecture including UNDO log design, record format, transaction integration, and heap AM integration details.
Formatting-only changes from pgindent: typedef brace alignment, pointer spacing, comment wrapping, and function argument alignment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.