Skip to content

feat(sync-service): hibernate before suspend to enable GC#4284

Draft
alco wants to merge 11 commits into
mainfrom
alco/hibernate-then-suspend
Draft

feat(sync-service): hibernate before suspend to enable GC#4284
alco wants to merge 11 commits into
mainfrom
alco/hibernate-then-suspend

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented May 6, 2026

Summary

When consumer suspend is enabled, consumers now hibernate first (triggering GC) before eventually suspending (terminating). Previously, consumers went directly to suspend without hibernating, and due to a large timeout for suspend, consumer processes could hold onto garbage memory for longer than necessary.

  • Add shape_suspend_after config (default 60s) - delay between hibernation and suspension
  • Consumer hibernates on hibernate_after timeout, schedules suspend timer
  • Suspend timer fires after suspend_after ms, terminates process if still idle
  • Any activity cancels the pending suspend timer, restarting the cycle
  • Update ConsumerRegistry.enable_suspend/4 to include suspend_after parameter

Test Plan

  • New test: consumer hibernates first, then suspends after suspend_after ms
  • New test: activity during hibernation cancels pending suspend timer
  • Updated existing enable_suspend test for new 4-arg function
  • All 36 consumer tests pass

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

❌ 5 Tests Failed:

Tests completed Failed Passed Skipped
3421 5 3416 15
View the top 1 failed test(s) by shortest run time
Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough
Stack Traces | 0s run time
29) test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:507
     ** (EXIT from #PID<0.9367.0>) killed
View the full list of 4 ❄️ flaky test(s)
test/horton-pull-wake-e2e.test.ts > pull-wake Horton e2e with mocked LLM > dispatches explicit runner-policy wakes and Horton writes mocked responses

Flake rate in main: 100.00% (Passed 0 times, Failed 1 times)

Stack Traces | 20.1s run time
Error: Timed out after 20000ms
 ❯ waitFor test/test-utils.ts:31:9
 ❯ test/horton-pull-wake-e2e.test.ts:185:5
test/wake-registry.test.ts > Wake Registry Integration > WakeEvent in subscriber stream includes source and change details

Flake rate in main: 100.00% (Passed 0 times, Failed 1 times)

Stack Traces | 10s run time
Error: Timed out waiting for 1 wakes (got 0)
 ❯ Timeout._onTimeout test/wake-registry.test.ts:838:13
test/wake-registry.test.ts > Wake Registry Integration > event append triggers wake delivery for change condition

Flake rate in main: 100.00% (Passed 0 times, Failed 1 times)

Stack Traces | 10s run time
Error: Timed out waiting for 1 wakes (got 0)
 ❯ Timeout._onTimeout test/wake-registry.test.ts:838:13
test/wake-registry.test.ts > Wake Registry Integration > spawn with wake registers condition and delivers wake on child run completion

Flake rate in main: 100.00% (Passed 0 times, Failed 1 times)

Stack Traces | 10.1s run time
Error: Timed out waiting for 1 wakes (got 0)
 ❯ Timeout._onTimeout test/wake-registry.test.ts:838:13

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@alco alco added the claude label May 6, 2026
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Claude Code Review

Summary

Iteration 3 review. Since iteration 2 (2026-05-07), the only change is a test-file formatting cleanup (fe0cbe4fa, 2026-05-18). The substantive iteration-2 findings — the activity/suspend-timer race and the :configure_suspend invalidation gap — are unchanged in the source. Re-raising them, plus one new flag for the implementation-plan file checked into the package.

What's Working Well (Unchanged Since Iteration 2)

  • Backwards-compatible config plumbing for shape_suspend_after is still consistent across runtime.exs, Config.@defaults, StackConfig, StackSupervisor schema, and Application.
  • The generation-counter idea remains a clean lock-free invalidation primitive — if the generation gets bumped on the events that should invalidate the timer (see Important Working with Postgres WAL format from Elixir #1).
  • Test formatting in consumer_test.exs:1564-1566 and :1512-1514 is now consistent with repo conventions.
  • Integration test (shape-suspension-resumption.lux) correctly threads ELECTRIC_SHAPE_SUSPEND_AFTER=200ms.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

1. (Re-raised from iter. 2) Activity in the last hibernate_after ms of the suspend window still terminates the consumer

File: packages/sync-service/lib/electric/shapes/consumer.ex — every activity handler that returns state.hibernate_after (:173, :180, :184, :190, :201, :209, :231, :251, :290, :299, :306, handle_apply_event_result/2).

Status from iter. 2: still unaddressed. state.suspend_generation is only bumped inside schedule_suspend_timer/1 (:430-434), which is only called from handle_info(:timeout, state) (:394-403). No activity handler bumps the generation directly, and none cancel the pending :erlang.send_after. So when activity at time A arrives in the window [T + suspend_after − hibernate_after, T + suspend_after], the original :suspend_timeout (gen=N) message wins the race against the next :timeout (which would have bumped state.suspend_generation to N+1), generation == state.suspend_generation evaluates true at consumer.ex:409, and consumer_suspend_enabled?/consumer_can_suspend? are still true, so the consumer terminates at :412 despite the recent activity.

Bug-zone size with current defaults (hibernate_after=30s, suspend_after=10m from config.ex:88-96 — note the default is now 10 minutes, not 60s as in iter. 2): a transaction landing in the last 30s of the 10-minute suspend window — 5% of the window. With the docstring example (60_000, 4 * 60_000, 60_000 * 20), the bug zone is 25% of the window.

Why the new "activity during hibernation cancels pending suspend" test (consumer_test.exs:1568-1629) does not catch this: the test sends activity at t≈80ms with hibernate_after=10 and shape_suspend_after=200. The bug zone is t∈[190, 210]ms (the last 10ms). At t=80, the next :timeout fires at ~t=90ms and bumps the generation to 2 long before the original gen=1 timer arrives at ~t=210ms. The test asserts liveness at t≈240ms but never lands activity inside the bug zone, so it does not actually exercise the cancellation contract — it only proves that timing happens to favor the post-activity :timeout. The test comment at :1620 ("new timer started at ~80ms, would fire at ~280ms") is also off: the new suspend timer is only scheduled when :timeout fires next at ~t=90ms, not at t=80ms when activity arrives.

Why it is still Important, not Critical: terminated consumers restart on the next request — wasted work, not data loss. But the claim in .changeset/hibernate-then-suspend.md ("Any activity cancels the pending suspend timer, restarting the cycle") and in the :suspend_timeout docstring at consumer.ex:405-407 is not actually true today.

Suggested fix (same as iter. 2 — restated for completeness): bump the generation on every activity. Either inline at each call site:

state = %{state | suspend_generation: state.suspend_generation + 1}

Or via a helper used by every handler that resets state.hibernate_after:

defp invalidate_suspend_timer(%{suspend_generation: gen} = state) do
  %{state | suspend_generation: gen + 1}
end

Plus a regression test that actually lands in the bug zone, e.g.:

@tag hibernate_after: 50, shape_suspend_after: 100, with_pure_file_storage_opts: [flush_period: 1]
@tag suspend: true
test "activity in last hibernate_after ms of suspend window does not terminate", ctx do
  # ... setup, wait until t≈10ms hibernated ...
  Process.sleep(80)  # now in last 50ms of 100ms suspend window
  ShapeLogCollector.handle_event(txn2, ctx.stack_id)
  Process.sleep(50)  # past original suspend_timer at ~t=110ms
  assert Process.alive?(consumer_pid)
end

2. (Re-raised from iter. 2) :configure_suspend does not invalidate pending suspend timer

File: packages/sync-service/lib/electric/shapes/consumer.ex:389-392

def handle_info({:configure_suspend, hibernate_after, suspend_after, jitter_period}, state) do
  state = %{state | hibernate_after: hibernate_after, suspend_after: suspend_after}
  {:noreply, state, Enum.random(hibernate_after..jitter_period)}
end

Unchanged since iter. 2. If suspend was previously enabled with a pending gen=N timer, then enable_suspend/4 is called again with new params, the new params are stored but the old gen=N timer fires at its original time and terminates the consumer using the old suspend_after value. Bumping the generation here (same helper as #1) makes the call idempotent and predictable.

3. (New) 837-line implementation plan markdown checked into packages/sync-service/

File: packages/sync-service/2026-05-06-hibernate-then-suspend.md (new, 837 lines)

This file is the agent-authored task-by-task implementation plan ("REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development..."). It documents the planned changes step-by-step, with commit messages and shell snippets — clearly a working document, not project documentation. It does not belong in the source tree:

  • It will rot quickly: the plan references line numbers and code snippets that diverged from what was actually shipped (e.g. it describes cancel_suspend_timer/1 and reply_with_timeout/1 helpers that ended up replaced by the generation-counter approach).
  • It is checked into the packages/sync-service/ root, which is otherwise reserved for lib/, test/, config/, etc. and a small set of top-level files (mix.exs, README.md, CLAUDE.md).
  • The actual user-visible change is already summarized in .changeset/hibernate-then-suspend.md.

Suggested fix: remove packages/sync-service/2026-05-06-hibernate-then-suspend.md from the PR (or move it to .review-context/ / a gist / the PR description if it is worth keeping a record). Adding a *.plan.md pattern to .gitignore for future agentic work would also help.

Suggestions (Nice to Have)

1. (Re-raised from iter. 2) Add an assertion that proves the cancellation fired

File: packages/sync-service/test/electric/shapes/consumer_test.exs:1568-1629

Once Important #1 is fixed, consider adding an assertion that :sys.get_state(consumer_pid).suspend_generation > 1 after activity, in addition to the timing-based refute_receive. Proves the contract instead of relying on timing.

2. (New, minor) Test comments at consumer_test.exs:1617-1621 are slightly off

# Original timer started at ~30ms, would fire at ~230ms
# We are now at ~80ms, wait 160ms to reach ~240ms
# But new timer started at ~80ms, would fire at ~280ms

The original suspend timer is scheduled when :timeout first fires (~t=10ms after txn1, because hibernate_after=10), not at "~30ms" (which is just when the test asserts hibernation status). And the "new timer started at ~80ms" line is wrong: the new suspend timer is only scheduled when the consumer hibernates again, at t≈90ms (10ms after the activity at t=80ms), not at t=80ms when activity arrives. The test still passes; the comments just do not match the actual algorithm.

Issue Conformance

No linked issue (carry-over). PR description and the changeset both claim "Any activity cancels the pending suspend timer, restarting the cycle." Once Important #1 is fixed, that claim will be accurate. Today it is a partial truth — it is true for activity arriving in the first (suspend_after − hibernate_after) ms of the window, false for the last hibernate_after ms.

Previous Review Status

Iter. 2 item Status as of iter. 3
Important #1 — Activity does not invalidate suspend timer ❌ Unchanged in source; re-raised above
Important #2:configure_suspend does not invalidate pending timer ❌ Unchanged in source; re-raised above
Suggestion #1 — Explicit assertion that cancellation fired ❌ Not addressed; re-raised
Suggestion #2suspend_generation overflow comment ➖ Trivial; leaving as-is

Net new commits since iter. 2: fe0cbe4fa ("Fix formatting in tests") — a 6-line whitespace adjustment in consumer_test.exs. No code-behavior changes.


Review iteration: 3 | 2026-05-18

@alco alco force-pushed the alco/hibernate-then-suspend branch from 3d7ade0 to 6029612 Compare May 7, 2026 10:14
alco and others added 11 commits May 18, 2026 14:12
Add new config option to control the delay between hibernation and
suspension. Default is 60 seconds. This prepares for hibernate-then-suspend
behavior where consumers first hibernate (to trigger GC) before suspending.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tate

Add state fields to track the scheduled suspend timer and the configured
delay between hibernation and suspension.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When suspend is enabled, consumers now:
1. Hibernate first on timeout (triggering GC)
2. Schedule a suspend timer for suspend_after ms later
3. Suspend (terminate) when the suspend timer fires

Any activity cancels the pending suspend timer, restarting the cycle.
This ensures GC runs before eventual process termination.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change enable_suspend/3 to enable_suspend/4, adding the suspend_after
parameter to configure the delay between hibernation and suspension.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests verifying:
- Consumer hibernates first, then suspends after suspend_after ms
- Activity during hibernation cancels the pending suspend timer
- Update enable_suspend test for new 4-arity function
- Update existing suspend test to use explicit suspend_after tag

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Clarify enable_suspend/4 docstring: jitter controls hibernation timing,
  suspension happens suspend_after ms after hibernation
- Update :configure_suspend handler comment to describe hibernate-then-suspend
- Add defensive guard for suspend_after: nil in schedule_suspend_timer/1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…n test

With hibernate-then-suspend, consumers hibernate first then wait
suspend_after before terminating. Set ELECTRIC_SHAPE_SUSPEND_AFTER=200ms
alongside ELECTRIC_SHAPE_HIBERNATE_AFTER=200ms so the test completes
within the expected timeframe.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@alco alco force-pushed the alco/hibernate-then-suspend branch from 6029612 to fe0cbe4 Compare May 18, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant