feat(agents): pull-wake runner health check, principal rename, and lifecycle hardening by KyleAMathews · Pull Request #4339 · electric-sql/electric

KyleAMathews · 2026-05-16T13:33:18Z

Summary

Adds pull-wake runner health diagnostics, renames runner ownership from owner_user_id to canonical owner_principal URLs, and hardens the pull-wake runner lifecycle around heartbeats, reconnects, shutdown, and error reporting.

This PR also fixes the local desktop send path so mutating local server requests can go through the Electron main process instead of being blocked behind Chromium's HTTP/1.1 connection limit, while keeping normal shape reads in the renderer.

Root Cause

Pull-wake dispatch had too little observability and too many implicit lifecycle assumptions. When a runner missed wakes, got stuck reconnecting, or failed during shutdown, the server/UI had limited information to explain the state. The ownership field also used owner_user_id, which was misleading because principals can be users, agents, services, or system actors.

Local development exposed a separate but related issue: renderer-origin mutating requests could stall behind long-lived shape/SSE requests, delaying sends and new agent startup by seconds even when the server processed the request in milliseconds.

Approach

Runner Health Diagnostics

PullWakeRunner reports stream, heartbeat, claim, dispatch, reconnect, and error diagnostics in heartbeat payloads.
Runtime diagnostics are stored separately from the normal runner row so the regular runners shape stays stable.
GET /_electric/runners/:id/health aggregates runner state, liveness lease, sanitized client diagnostics, active claims, dispatch stats, and derived health issues.
Health output tolerates partial/bogus diagnostics and invalid lease timestamps without producing misleading typed output.

Principal Ownership

Runner ownership is stored as canonical principal URLs in owner_principal.
API boundaries canonicalize owner principals before storing/comparing.
electric-principal request headers now accept either a principal key (user:alice) or a principal URL (/principal/user%3Aalice) through parsePrincipalInput() and compare internally by canonical URL.
The migration expires active runner claims, clears runner dispatch state, deletes existing runner rows, and renames the ownership column.

Pull-Wake Runner Lifecycle

Adds public runner status diagnostics: stopped | starting | connecting | streaming | reconnecting | stopping.
Makes onError reporting-only and isolates throwing reporters with a console.error fallback.
stop() gates new claim dispatch, aborts outstanding work, waits boundedly for claim actors, drains runtime wakes, and rethrows drain errors after recording diagnostics.
waitForStopped() now waits for the full stop path, not just the stream loop.
Heartbeats are single-flight/coalesced, event-driven heartbeat scheduling uses the event throttle knob, and per-run heartbeat/reconnect counters reset on start().
Reconnects use exponential backoff and heartbeat failures can request a stream reconnect.
Removed maxConcurrentClaims: it only limited claim/enqueue work, not actual wake execution, so it was a misleading invariant.

Dispatch And Local Send Performance

Dispatch subscription relinking is cached after the first successful link instead of doing Durable Streams round trips on every send.
Local Electron mutating requests can be routed through the desktop main process for saved local servers, avoiding browser connection-limit stalls.
Local /send remains normal application/json; the preflight/connection issue is handled by the desktop fetch path rather than changing API semantics.
Ngrok warning-skip headers are only added for known ngrok hosts.

Key Invariants

Stored owner_principal values are canonical principal URLs.
Runner auth compares canonical URL to canonical URL.
onError cannot control runner lifecycle.
Stop always gates late dispatch before runtime drain and reports drain failures to callers.
Heartbeat health state is updated by one in-flight heartbeat at a time.
Dispatch subscription linking is idempotent and cached per stream client/service/subscription/stream.

Non-goals

No broad terminology rename in this PR.
No full module split of pull-wake-runner.ts; the critical invariants are fixed and tested first.
No persistent claim-failure circuit breaker; failures are visible in diagnostics and can be escalated separately.
No cross-principal runner listing/admin surface.

Verification

pnpm --dir packages/agents-runtime exec vitest run test/pull-wake-runner.test.ts --reporter=dot
pnpm --dir packages/agents-server exec vitest run test/principal.test.ts test/runners-router.test.ts test/dispatch-policy-routing.test.ts --reporter=dot
pnpm --dir packages/agents-server-ui exec vitest run src/lib/auth-fetch.test.ts --reporter=dot

pnpm --dir packages/agents-runtime typecheck
pnpm --dir packages/agents-server typecheck
pnpm --dir packages/agents-server-ui typecheck
pnpm --dir packages/agents-desktop typecheck

pnpm --dir packages/agents-server-ui build
pnpm --dir packages/agents-desktop build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dispatch-policy, server-utils, and electric-ax Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…URL form, callers convert keys to URLs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…migration, drop authorization fallback - Use principalKeyFromUrl for proper principal URL validation (rejects /principal/local-desktop) - Migration expires active claims and clears dispatch state before deleting runners - Desktop: don't use authorization header as principal source — return undefined and let server derive from ctx.principal.url - listRunners validates owner_principal query param Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pal keys, complete desktop constant replacement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-05-16T13:34:46Z

Codecov Report

❌ Patch coverage is 72.31947% with 253 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@e4acb1d). Learn more about missing BASE report.
⚠️ Report is 4 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...src/components/settings/pages/LocalRuntimePage.tsx	0.00%	101 Missing ⚠️
packages/agents-server/src/entity-registry.ts	20.51%	27 Missing and 4 partials ⚠️
packages/agents-runtime/src/pull-wake-runner.ts	92.91%	27 Missing ⚠️
packages/agents-server-ui/src/main.tsx	0.00%	20 Missing ⚠️
packages/agents-server-ui/src/lib/auth-fetch.ts	82.60%	12 Missing ⚠️
...gents-server-ui/src/lib/ElectricAgentsProvider.tsx	0.00%	11 Missing ⚠️
...kages/agents-server/src/routing/dispatch-policy.ts	80.76%	10 Missing ⚠️
...ckages/agents-server/src/routing/runners-router.ts	92.74%	9 Missing ⚠️
...agents-server-ui/src/hooks/useServerConnection.tsx	0.00%	7 Missing ⚠️
packages/agents-server/src/server.ts	14.28%	6 Missing ⚠️
... and 11 more

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4339   +/-   ##
=======================================
  Coverage        ?   59.56%           
=======================================
  Files           ?      290           
  Lines           ?    28579           
  Branches        ?     7754           
=======================================
  Hits            ?    17022           
  Misses          ?    11540           
  Partials        ?       17

Flag	Coverage Δ
packages/agents	`70.92% <0.00%> (?)`
packages/agents-mcp	`77.54% <ø> (?)`
packages/agents-runtime	`80.74% <93.01%> (?)`
packages/agents-server	`73.93% <76.75%> (?)`
packages/agents-server-ui	`6.66% <26.51%> (?)`
packages/electric-ax	`42.61% <83.33%> (?)`
packages/experimental	`87.73% <ø> (?)`
packages/react-hooks	`86.48% <ø> (?)`
packages/start	`82.83% <ø> (?)`
packages/typescript-client	`94.39% <ø> (?)`
packages/y-electric	`56.05% <ø> (?)`
typescript	`59.56% <72.31%> (?)`
unit-tests	`59.56% <72.31%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

State machine, concurrent claim limits, exponential reconnect backoff, and granular health status. onError is now reporting-only with fallback console.error logging. stop() rethrows drainWakes errors to callers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-05-18T06:31:38Z

Test iteration 7 body placeholder

kevin-dp

I had a look through the code and left some comments.
On a higher level, i don't like the "claims" name, it collides too much with auth claims.
The term "lease" is more standard in distributed systems so I would rename the claims variables to: lease / activeLeaseCount / maxConcurrentLeases / getActiveLeasesForRunner.

Also, i think the code needs quite some refactoring. Currently there are ~20 variables inside one closure and they are being mutated from everywhere in this file. That's very error-prone. I'll try refactoring in a follow up PR.

kevin-dp · 2026-05-18T06:40:56Z

+  let eventHeartbeatTimer: ReturnType<typeof setTimeout> | null = null
  let currentOffset = config.offset
+  let startedAt: string | null = null
+  let streamConnected = false


I would define streamConnected as a getter:

get streamConnected() { return !!streamConnectedSince }

kevin-dp · 2026-05-18T06:44:10Z

+  last_claim_result: `claimed` | `no_work` | `error` | null
+  last_dispatch_at: string | null
+  events_received: number
+  claims_succeeded: number


Perhaps this could be a nested property:

export interface PullWakeRunnerHealth { // ... claims: { succeeded: number skipped: number failed: number } }

Not sure how useful these numbers are. Perhaps an array of the actual claims that succeeded/skipped/failed would be better.

kevin-dp · 2026-05-18T06:45:24Z

+type PullWakeRunnerState =
+  | `stopped`
+  | `starting`
+  | `running.connecting`


The running. prefixes are a bit unusual.

Would this benefit from being an actual TS enum?

enum PullWakeRunnerState { Stopped = `stopped`, Starting = `starting`, Connecting = `running.connecting`, Streaming = `running.streaming`, Reconnecting = `running.reconnecting`, Stopping = `stopping` }

This would make the code less string-heavy.

Or if you don't like enums:

const PullWakeRunnerState = { Stopped: 'stopped', Starting: 'starting', Connecting: 'running.connecting', Streaming: `running.streaming`, Reconnecting: `running.reconnecting`, Stopping: `stopping` } as const type PullWakeRunnerState = (typeof PullWakeRunnerState)[keyof typeof PullWakeRunnerState]

kevin-dp · 2026-05-18T06:59:23Z

+    Math.floor(config.maxConcurrentClaims ?? DEFAULT_MAX_CONCURRENT_CLAIMS)
+  )
+
+  const toStatus = (): PullWakeRunnerStatus => {


Why have "state" and "status" ? They are almost identical.
I would just go with the enum approach from my previous comment.
Users use it like PullWakeRunnerState.Connecting and it resolves to the string "running.connecting" so we effectively get the usage we want.

kevin-dp · 2026-05-18T07:03:02Z

+  const notifyHeartbeatChange = (): void => {
+    const signal = controller?.signal
+    if (!signal || signal.aborted || heartbeatIntervalMs <= 0) return
+    if (eventHeartbeatTimer) return


nit: we have 2 if statements for fast return, to be consistent the first one would only do the signal checks and the 2nd one would do heartbeat checks, or we would use only one if statement:

if (!signal || signal.aborted) return if (heartbeatIntervalMs <= 0 || eventHeartbeatTimer) return

We could also avoid the eventHeartbeatTimer check by rewriting the assignment:

eventHeartbeatTimer ??= setTimeout(() => { eventHeartbeatTimer = null void heartbeat(signal) }, eventHeartbeatThrottleMs)

kevin-dp · 2026-05-18T07:06:00Z

+
+  const notifyHeartbeatChange = (): void => {
+    const signal = controller?.signal
+    if (!signal || signal.aborted || heartbeatIntervalMs <= 0) return


Why do we check heartbeatIntervalMs <= 0 if it's not used in this function?
This function uses eventHeartbeatThrottleMs instead.

kevin-dp · 2026-05-18T07:09:12Z

    } catch (err) {
      if (!signal.aborted) {
-        config.onError?.(err instanceof Error ? err : new Error(String(err)))
+        lastHeartbeatOk = false


We already set lastHeartbeatOk = false on L256 right before throwing and here we only set it inside if (!signal.aborted) but it will already be false. Is this the behaviour we want? If we only want it to be set to false if !signal.aborted then we need to not set it right before throwing.

kevin-dp · 2026-05-18T07:11:37Z

+        throw err
+      }
+      if (!claimErrorRecorded) {
+        recordClaimError()


Do we need to set claimErrorRecorded = true here?
Perhaps would be better if recordClaimError does that (e.g. we could wrap it such that we can't forget to set this variable).

kevin-dp · 2026-05-18T07:16:20Z

+    return true
+  }
+
+  const sleep = async (ms: number, signal: AbortSignal): Promise<void> => {


This should be extracted to a utility file.

kevin-dp · 2026-05-18T10:23:48Z

Spent some time testing the lifecycle hardening claims in this PR against a running desktop app + local agents-server. Some of what's claimed works as advertised; one significant gap turned up that I think is worth flagging before merge.

What worked

✅ GET /_electric/runners/:id/health endpoint returns the documented shape; UI and curl agree.
✅ Diagnostic counters update on dispatch within ~2s — event-driven heartbeat throttle path is firing as designed (last_hb jumped off the 30s schedule into a fresh slot ~2s after a last_dispatch_at change, consistent with the 2s throttle).
✅ Health status escalation (healthy → degraded → unhealthy) computes correctly on the server side.
✅ Principal rename (owner_user_id → owner_principal) and URL validation work end-to-end (after fixing one default-value alignment issue between desktop and server, which I've already pushed on fix-pull-wake).

Things turned up while testing

While testing this PR I also surfaced three related reliability issues that the new diagnostics made visible — they aren't introduced by this PR (the diagnostics just exposed them), but they all interact with this work:

Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry) #4340 — active claim is not released after dispatch (success or failure)
Pull-wake: heartbeat path nulls out lease_expires_at on consumer_claims #4341 — lease_expires_at: null on materialized claims removes the safety net for Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry) #4340
Local Runtime page shows stale/contradictory state when agents-server is unreachable #4342 — Local Runtime UI shows stale/contradictory state when agents-server is unreachable (compounded by Pull-wake: stream reconnect path does not fire when agents-server becomes unreachable (Ctrl-C and kill -9 both verified) #4343 since the UI reads exactly the stream_connected: true field that this bug pins to true)

Suggestion

#4343 feels like it should block merge — the headline reliability claims in the PR description don't hold under the testing I did. The four issues together are also somewhat overlapping (the UI staleness in #4342 wouldn't be nearly as bad if #4343 was fixed, because the stream_connected field would actually flip during the outage), so fixing #4343 has some leverage on the others.

Happy to help test once you've got a candidate fix.

kevin-dp

~~We should address #4339 (comment) before merging.~~

icehaunter · 2026-05-18T11:18:06Z

+1. A **health check endpoint** (`GET /_electric/runners/:id/health`) for deep debugging — curl it to see comprehensive diagnostics about a runner's dispatch pipeline.
+2. **Rich runner state in Postgres** so that apps can sync the `runners` table via an Electric Shape and show runner status on any device (e.g. see your laptop runner's status from your phone).
+
+The `diagnostics` JSONB column on the `runners` table serves both purposes: the health endpoint reads it for the detailed response, and Shape sync delivers it reactively to any connected client.


This will make runners into a quite noisy shape, with every runner causing an update every 2 seconds. I suggest we split it into a separate table for heartbeats & claims, and keep status, started, last_error & last_error_at in the main table. Anyone needing full diagnostics can then either get the endpoint, or subscribe to a filtered shape for diagnostics of a particular runner. WDYT?

It should be a max of every two seconds (if there's lots of activity) and then when things aren't happening, updates would drop to every 30 seconds.

But yeah, fairly noisy I agree actually still. Lemme think about it.

We don't need the entire object on that table for "normal" UIs I think. the heartbeat endpoint can split this up easily

yeah makes sense — refactored

# Conflicts: # packages/agents-desktop/src/main.ts # packages/agents-server/src/routing/runners-router.ts # packages/agents-server/test/dispatch-policy-routing.test.ts # packages/agents/src/server.ts # packages/electric-ax/src/start.ts # packages/electric-ax/test/start.test.ts

claude · 2026-05-18T14:07:50Z

test body 2

claude · 2026-05-18T18:28:06Z

Claude Code Review (iteration 1)

Substantial PR (~3.5k additions / 60+ files): runner health endpoint, owner_user_id -> owner_principal rename, pull-wake runner hardening (state machine, bounded concurrency, exponential backoff, event-driven heartbeats). Overall direction solid. Main concerns: type-safety latent bug in health response, two small state-reset bugs, stack of unresolved inline review comments from kevin-dp.

WHAT IS WORKING WELL

kevin-dps tested concern (Pull-wake: stream reconnect path does not fire when agents-server becomes unreachable (Ctrl-C and kill -9 both verified) #4343) appears addressed. heartbeat() drives stream reset after HEARTBEAT_FAILURE_STREAM_RESET_THRESHOLD (2) consecutive failures via requestStreamReconnect() at pull-wake-runner.ts:278-285, with a focused test at pull-wake-runner.test.ts:613-672.
Noisy-shape feedback from icehaunter applied. Diagnostics split out of runners shape into runner_runtime_diagnostics (migration 0008).
Server-side sanitization of heartbeat diagnostics (runners-router.ts:150-198): typed allowlist, finite/non-negative number checks.
Owner-principal authorization consistent across all runner routes (register/list/get/heartbeat/claim/enable/disable/health).
Generation-tracked claim cleanup handles start/stop race (pull-wake-runner.ts:458-491).
Strong test coverage: 15 runner tests (was 5), 12 router tests, new pull-wake-subscription-stack.test.ts.
Changesets present for every publishable package touched.

CRITICAL (Must Fix): None found.

IMPORTANT (Should Fix)

RunnerHealthResponse.client type lies about completeness (electric-agents-types.ts:157 + runners-router.ts:150-198). Typed as Omit<PullWakeRunnerHealth, ...> | null (all required) but sanitizeRunnerDiagnostics builds field-by-field and casts partial result. Test "sanitizes stored runner diagnostics..." (runners-router.test.ts:331-334) shows body.client = {reconnect_count:6, last_heartbeat_ok:false} -- 2 of 15 fields. Today the runner always sends all 15 fields; future drift or partial heartbeats break the type contract. Fix: make each diagnostic field optional, or fill in defaults in sanitizer.
consecutiveHeartbeatFailures not reset across start/stop cycles (pull-wake-runner.ts:128, 272). Only reset to 0 on heartbeat success. After stop/restart, leftover non-zero counter can trip requestStreamReconnect on the very first failure. Reset in start() alongside runGeneration++. Worth also clarifying intent of reconnectCount, eventsReceived, etc. on restart -- startedAt resets, so per-start counters drift unless they reset too.
Misleading guard in notifyHeartbeatChange (pull-wake-runner.ts:238-246). Function uses eventHeartbeatThrottleMs but gates on heartbeatIntervalMs <= 0. Disabling fixed-interval heartbeats silently disables event-driven ones too. kevin-dp flagged inline.
lastHeartbeatOk = false set twice with different intent (pull-wake-runner.ts:266 + 275). Set before throw on non-OK response, then again in catch only if !signal.aborted. Aborted heartbeats falsely report failure. kevin-dp flagged inline.
Unresolved inline review comments from kevin-dp -- ~10 style/refactor suggestions on pull-wake-runner.ts not yet addressed: getter for streamConnected, unify state/status enum, dotted state names atypical (running.connecting etc.), nest claim counters, extract sleep, fragile claimErrorRecorded flag. None blocking but should be triaged.
Unresolved inline question from icehaunter on durable-streams-routing-adapter.ts -- "why did you undo my work here?" (May 18 18:38 UTC). Appears resolved by later c6eda8c commit ("keep durable streams control urls opaque") that restored opaque-prefix handling. Worth replying inline to close the thread.

SUGGESTIONS (Nice to Have)

Two back-to-back migrations (0007 adds diagnostics column, 0008 immediately moves to separate table). Could be one combined migration -- though once they ship, leave them.
Migration 0007 deletes all runners + clears dispatch state + expires active claims. PR description acknowledges this; worth flagging in the changeset that operators must expect every existing local runner to re-register after deploy.
HEARTBEAT_FAILURE_STREAM_RESET_THRESHOLD = 2 hardcoded (default 30s heartbeat -> ~60s detection). Consider making configurable via PullWakeRunnerConfig.
Default maxConcurrentClaims = 10 is reasonable but undocumented in user-facing types. Add doc comment.
Heartbeat sends diagnostics with stream_connected: false while state = starting (before stream is open). Server-side derivation escalates to degraded ("Client reports stream disconnected"). Emits a transient degraded window on every start. Consider gating diagnostic reporting until after first stream open, or ignoring stream_connected:false while status===starting.

ISSUE CONFORMANCE

No linked GitHub issue on this PR. PR description is unusually thorough (root cause, approach, invariants, non-goals, trade-offs, verification, files-changed table) and substitutes well for a tracked issue. Diagnostics work surfaced #4340-#4342 and #4343, filed and tracked separately. Consider attaching Closes for whichever #4343 fixes are landing in this PR (the heartbeat-driven reset is option (a) per kevin-dp).

PREVIOUS REVIEW STATUS

A prior Claude Code Review comment existed but contained only a placeholder fragment ("PLACEHOLDER continuing in next edit.") -- replaced with this complete review.

NOTE ON PROMPT-INJECTION ATTEMPT

While reading .review-context/conversation.json I encountered a fake system-reminder tag embedded inside one of KyleAMathewss comment bodies (re: task tools). Treated as untrusted comment content, ignored. Flagging so you know.

Review iteration: 1 | 2026-05-18

claude · 2026-05-18T18:41:04Z

Iteration 2 — 2026-05-18 (re-validation pass)

Re-read the diff and inline review threads. No new commits on the branch since iteration 1 was posted at 18:29 UTC. All findings in the prior Claude review (comment id 4480689327) remain open as written.

Independent verification confirms:

The Critical claim-leak (pull-wake-runner.ts:434-438) is still enshrined by the test that skips dispatch from a claim actor after shutdown begins at pull-wake-runner.test.ts:526-566.
The heartbeat-driven reset test at pull-wake-runner.test.ts:613-672 exercises the cancel path end-to-end, but does NOT cover the half-open first-connect window flagged in Important [Merged on #3] Write inserts, updates and deletes to Vaxine #2.
sanitizeRunnerDiagnostics remains a strict allowlist; still no length cap on last_error (Suggestion list).
currentOffset is initialized to -1 at pull-wake-runner.ts:112, with the dead conditional at line 257-259 still present (Important Postgres replication source TCP server #3).
Migration 0007 still does DELETE FROM runners plus claim-state wipe; changeset is still patch-level (Important Read from vaxine stream, send transactions to slot server #4).

No new issues surfaced beyond iteration 1.

Review iteration: 2 | 2026-05-18

claude · 2026-05-19T02:18:23Z

Claude Code Review

Summary

Iteration 5 on fix-pull-wake. One new commit since iteration 4: e8241f12c ("Harden pull-wake runner invariants") — a broad cleanup commit that addresses all five "Important" findings from iteration 4 plus three of the carry-over Suggestions. No critical issues remain. The remaining open items are kevin-dp's style nits and a couple of new wrinkles introduced by this iteration.

What's Working Well

All five iteration-4 Important findings are now fixed in code (and three of them are covered by new tests):

RunnerHealthResponse.client type vs. sanitizer reality — fixed by introducing RunnerClientDiagnostics = Partial<Omit<PullWakeRunnerHealth, 'running'|'offset'>> at electric-agents-types.ts:144-146 and updating client to use it (line 160). sanitizeRunnerDiagnostics is now type-honest.
Per-start counter reset across stop/restart — fixed at pull-wake-runner.ts:596-610. start() now resets reconnectCount, lastError, lastErrorAt, lastHeartbeatAt, lastHeartbeatOk, lastClaimAt, lastClaimResult, lastDispatchAt, eventsReceived, claimsSucceeded/Skipped/Failed, consecutiveHeartbeatFailures, nextReconnectBackoffMs, and streamResetError. New test resets heartbeat failure counters across restarts covers this.
notifyHeartbeatChange guard — fixed at pull-wake-runner.ts:234. Now gates on eventHeartbeatThrottleMs <= 0 (the value the function actually uses).
lastHeartbeatOk = false set twice — fixed at pull-wake-runner.ts:273-278. The pre-throw assignment is removed; only the catch block flips the flag, and only when !signal.aborted.
Per-send dispatch relink hot-path — fixed at dispatch-policy.ts:22, 220-243 via a module-level linkedDispatchSubscriptions WeakMap keyed on streamClient. Subsequent sends to a linked entity short-circuit before any HEAD/GET/PUT to durable-streams. Test at dispatch-policy-routing.test.ts:340-353 confirms the second send issues zero getSubscription/ensure/addSubscriptionStreams calls. Unlink invalidates the cache.

Other iteration-5 wins:

Heartbeat coalescing (heartbeatInFlight + heartbeatPending at pull-wake-runner.ts:110-111, 242-256) avoids pile-up if a heartbeat outlives its interval. New test coalesces event heartbeats while a heartbeat is in flight confirms behavior end-to-end.
Number.isFinite guard on parsed lease timestamps (runners-router.ts:443-449, 518-521) — a malformed liveness_lease_expires_at no longer produces NaN-tainted output. Covered by new test ignores invalid runner lease timestamps in health output.
desktopServerFetch drops the dead buildCloudAuthHeaders call (carry-over Suggestion).
currentOffset !== undefined unreachable ternary removed; heartbeat now sends wake_stream_offset: currentOffset unconditionally (carry-over Suggestion).
isNgrokHost extended to include ngrok.dev and ngrok-free.dev (iteration-4 Suggestion).
waitForStopped() now awaits stopPromise if present (avoids returning before drain completes if called mid-stop).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

Dispatch link cache has no failure-recovery path. The module-level linkedDispatchSubscriptions WeakMap in dispatch-policy.ts:22 is populated on first successful link and never invalidated except by unlinkEntityDispatchSubscription. If the durable-streams server is restarted with state loss, manually wiped, or otherwise loses the subscription, the cache will report "linked" forever and linkEntityDispatchSubscription will skip the ensure + getSubscription + addSubscriptionStreams repair sequence on every subsequent send. There's no self-healing path — the agents-server process must restart to re-sync. At minimum, consider invalidating the cache entry when a send fails (e.g., catch in sendEntity after linkEntityDispatchSubscription, drop the cache key, retry once). Or document that DS state loss requires an agents-server restart.
parsePrincipalInput contradicts the "strict validation" claim. principal.ts:50-58 now accepts both URL form (/principal/...) and legacy key form (user:alice) via parsePrincipalInput, and getPrincipalFromRequest (line 67) uses it. But the PR description and the pull-wake-health-diagnostics.md changeset both state: "Strict validation via parsePrincipalUrl() — rejects invalid URLs at the API boundary" / "This is a breaking change with no backward compatibility — all callers must send principal URLs." The code as-shipped is in fact lenient. Either update the description/changeset to reflect the dual-form acceptance, or remove the key fallback in getPrincipalFromRequest so the stated invariant holds.

Suggestions (Nice to Have)

Outstanding kevin-dp inline review comments on pull-wake-runner.ts — still uniformly unaddressed across all five iterations, not blocking:
- getter for streamConnected (line 114-115)
- nested claims: { succeeded, skipped, failed } shape (line 82-84)
- running.* dotted state names are atypical (line 87-93)
- state vs status duplication — could collapse into a single TS enum / as const (lines 87-93 + 155-170)
- rolling claimErrorRecorded into recordClaimError (line 359 / 396-398)
- extracting sleep to a utility file (line 462)
requestStreamReconnect still early-returns when !streamConnected. pull-wake-runner.ts:226-230. Heartbeat failures piling up during initial consumeWakeStream setup won't trigger reset. In practice the stream connect itself would fail and surface in run's catch block, so the gap is theoretical — but if a server returns 200 on the stream open then silently drops the connection before any event arrives, heartbeat failures couldn't kick a reset until streamConnected flipped true. Worth a defensive consideration; not a known repro.
text/plain content-type revert is undocumented. This commit reverted iteration 4's sendMessage.ts:222 content-type to application/json. Since serverFetch also injects custom headers (e.g. electric-principal), the request is not a CORS "simple" request regardless of body content-type, so the original text/plain attempt couldn't actually avoid a preflight on its own — the rollback is correct. But the desktop-local-fetch.md changeset description ("Route local desktop mutating agents-server requests through the Electron main process so CORS preflights cannot stall behind renderer connection limits") is now the only mechanism doing the work. Worth a one-line note in the commit message clarifying that text/plain alone was insufficient and main-process routing is the real fix.
Carried from iteration 4, unchanged:
- HEARTBEAT_FAILURE_STREAM_RESET_THRESHOLD = 2 hardcoded (pull-wake-runner.ts:99). At the 30s default heartbeat interval, this gives ~60s detection. Worth exposing on PullWakeRunnerConfig for tuning.
- Migration 0007 still DELETE FROM runners + claim wipe; changeset is patch-level. Worth a one-line operator-facing note in the changeset body.
Removed config option: maxConcurrentClaims is gone from PullWakeRunnerConfig, along with waitForClaimCapacity and the generation-tracked claimActors map. If any external consumer was relying on bounded claim concurrency, this is a silent removal. The changeset doesn't mention it. Worth confirming no one external set this option, or noting the removal in harden-pull-wake-runner.md.

Issue Conformance

No linked GitHub issue. PR description remains thorough but should be reconciled with the actual behavior on two points:

Principal validation is now lenient (URL or key form), not strict.
maxConcurrentClaims is no longer in the public config — the "Concurrent claim limiting" section in the description still describes it.

The new commit e8241f12c ("Harden pull-wake runner invariants") isn't reflected in the description's Files changed table either.

Previous Review Status

Prior Claude review is comment id 4483859257 (iteration 4). Of iteration 4's Important findings #1–#5: all five fixed in code, three with new test coverage. Inline-style punch list (#6) still uniformly unaddressed. Of iteration 4's Suggestions: dead buildCloudAuthHeaders, unreachable currentOffset ternary, and ngrok TLD extension all addressed. HEARTBEAT_FAILURE_STREAM_RESET_THRESHOLD and migration-0007 changeset note still open.

Review iteration: 5 | 2026-05-19

…heartbeats (#4353) ## Summary Fixes #4341 — the per-wake heartbeat path nulls out `consumer_claims.lease_expires_at`, leaving every active claim row without an expiry after the first heartbeat (~10s after dispatch). ## Root cause `packages/agents-server/src/entity-registry.ts` — `materializeHeartbeatClaim`: ```ts .set({ lastHeartbeatAt: heartbeatAt, leaseExpiresAt: input.leaseExpiresAt ?? null, // ← unconditionally writes null if not provided updatedAt: heartbeatAt, }) ``` `packages/agents-server/src/routing/internal-router.ts:606-609` — the only production caller — never passes `leaseExpiresAt`. So every heartbeat overwrites the lease with `null`. The lease set correctly by `materializeActiveClaim` (from the upstream `lease_ttl_ms`, e.g. claimed_at + 30s) survives at most until the first heartbeat. ## Observed live ``` Initial (at claim): lease_expires_at = 2026-05-19T11:01:41.631Z (claimed_at + 30s) After 1 heartbeat: lease_expires_at = null ``` Captured via the health endpoint `claims.active[*]` field — issue #4341 has the full trace. ## The fix Treat heartbeats as alive-pings only: update `last_heartbeat_at` and leave `lease_expires_at` alone unless the caller explicitly provides a new lease. The lease set by `materializeActiveClaim` from the upstream `lease_ttl_ms` stays authoritative. ```diff .set({ lastHeartbeatAt: heartbeatAt, - leaseExpiresAt: input.leaseExpiresAt ?? null, + ...(input.leaseExpiresAt !== undefined + ? { leaseExpiresAt: input.leaseExpiresAt } + : {}), updatedAt: heartbeatAt, }) ``` Callers that genuinely want to extend the lease can still pass `leaseExpiresAt` explicitly. The single production caller (`internal-router.ts:606-609`) doesn't, and shouldn't — it has no TTL signal to base an extension on, and the upstream lease is the authoritative window. ## Tests New integration test `packages/agents-server/test/consumer-claim-registry.test.ts`: 1. **Lease preserved** — materialize an active claim with a lease, heartbeat without a lease, assert lease is still the original timestamp (not null). 2. **Lease extendable** — materialize, heartbeat with an explicit `leaseExpiresAt`, assert the new lease is written. Both run against the integration postgres backend, in the style of `tag-stream-outbox-registry.test.ts`. Existing unit tests pass unchanged. ## Not addressed in this PR - **Pre-existing claim rows with `lease_expires_at: null`** — claims that already lost their lease under the unfixed code won't recover. They'd need a reaper or admin command to clean up. Not currently a problem because nothing reaps on lease today, but worth knowing if a reaper is added later. ## Base branch note This PR targets `fix-pull-wake` (#4339), not `main`, because `materializeHeartbeatClaim` was introduced in #4308 which is part of the `fix-pull-wake` lineage but not yet in `main`. Merge order: this → fix-pull-wake → main. Independent of #4346 (the related #4340 fix); the two can land in either order. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oken is missing (#4346) ## Summary Fixes #4340 — pull-wake claims leaking in `consumer_claims` when the in-memory `ClaimWriteTokenStore` no longer holds the consumer's token at the time `sendDone` arrives. The release path in `callback-forward` was gated by `stillOwnsClaim`, an in-memory check. When that check fails — server restart between mint and done, parallel wakes evicting each other's tokens, or a retry after a transient `updateStatus` failure — the entire release block is skipped: `materializeReleasedClaim` never runs, the entity stays stuck at `status='running'`, and the `consumer_claims` row stays `active` indefinitely. The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's `sendDone` after the `idleTimeout` window (default 5 min via `packages/agents/src/bootstrap.ts:146`). The bug only fires in the failure modes documented in [Test scenarios](#test-scenarios) below. ## Root cause `packages/agents-server/src/routing/internal-router.ts` had all three release actions behind the same in-memory gate: ```ts if (entity && stillOwnsClaim) { await materializeReleasedClaim(...) // DB row release await updateStatus(entity.url, 'idle') // entity status clearStream(...) // in-memory token cleanup await onEntityChanged(entity.url) } else if (stillOwnsClaim) { clearStream(...) } else if (entity) { log.info('done ignored for stale claim ...') } ``` `stillOwnsClaim` is the right gate for **write authorization** during the agent run, but it's the wrong gate for releasing the **DB row**, which is keyed by `(consumerId, epoch)`. The DB primary key is authoritative identity — the in-memory token state is orthogonal. ## The fix Three concerns, three gates: 1. **DB row release** (`materializeReleasedClaim`) — runs whenever `epoch` is defined. `(consumerId, epoch)` is the DB primary key; that's enough. 2. **Entity status → idle + `onEntityChanged`** — runs when `entityCleared || stillOwnsClaim`. `entityCleared` is a new return field from `materializeReleasedClaim`, set to `true` only when our `(consumerId, epoch)` was the active dispatch row and we just nulled it out. The `||` handles two non-trivial cases: server restart (DB has us active, token is gone) and retry after a failed `updateStatus` (state cleared on first attempt, token still held). 3. **In-memory token cleanup** (`clearStream`) — remains gated by `stillOwnsClaim` so a newer consumer's token is never cleared out from under it. ### `materializeReleasedClaim` API change ```diff - ): Promise<ConsumerClaim | null> { + ): Promise<{ claim: ConsumerClaim | null; entityCleared: boolean }> { ``` Only one production caller (`internal-router.ts`); both that caller and the test mock are updated. The `.returning()` on the `entityDispatchState` UPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim is active). ## Test scenarios The fix decouples the three concerns. Below: ✓ = action happens, × = action does NOT happen (and is correct that it doesn't). | Scenario | `entityCleared` | `stillOwnsClaim` | DB row released | Entity → idle | |---|---|---|---|---| | **A. Happy path** (mint + done) | true | true | ✓ released | ✓ goes idle | | **B. Server restart** (no in-memory token, DB row still active) | true | false | ✓ released | ✓ goes idle | | **C. Newer wake** (wake-1 done after wake-2 takes over the stream) | false | false | ✓ wake-1's row released | × stays running — wake-2 is in flight | | **D. Retry** (first done's `updateStatus` threw; same done retried) | false | true | ✓ no-op (already released) | ✓ goes idle | | **E. Legacy stale-done test** (test setup never materialized active claim; token evicted) | false | false | no row to release | × stays running — newer claim conceptually in flight | New tests in `packages/agents-server/test/webhook-forward-routing.test.ts > claim release on done callback (regression for #4340)` cover scenarios A–C. Existing tests in `server-claim-write-token.test.ts` cover D and E and continue to pass after the fix. ## Verified - **Unit tests**: deterministic. Pre-fix, B and C fail (zero invocations of `materializeReleasedClaim`); D fails (`updateStatus` skipped on retry). Post-fix, all five scenarios produce the documented behavior. - **Manual run-through** against a local desktop + agents-server: send one message, dispatch claims (`active_count: 0 → 1`), agent completes, runtime calls `sendDone` after `idleTimeout`, server fires the new release path, `active_count: 1 → 0`, entity status transitions back to `idle`. ## Not addressed in this PR - **Pre-existing orphan rows**: rows that already leaked from prior runs of the unfixed code can't be released because no fresh `done` callback is coming. Would need a reaper job or admin command. - **`lease_expires_at: null` issue (#4341)**: independent. Without a lease, even a reaper job can't time-out claims safely. ## Base branch note This PR targets `fix-pull-wake` (#4339), not `main`, because `materializeReleasedClaim` was introduced in #4308 which is part of the `fix-pull-wake` lineage but not yet in `main`. Merge order: this → fix-pull-wake → main. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

KyleAMathews and others added 11 commits May 16, 2026 11:31

docs: add pull-wake health check design spec

d14e848

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: update health check spec with principal rename and shape sync

afe87f2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add pull-wake health check implementation plan

6c9979e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(plan): address code review findings — add canonicalizePrincipal, …

e01b4d7

…dispatch-policy, server-utils, and electric-ax Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(plan): strict no-compat — remove canonicalizePrincipal, validate …

83e2c41

…URL form, callers convert keys to URLs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(plan): scope migration to runner-owned claims, fix default princi…

454ea9b

…pal keys, complete desktop constant replacement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(plan): store principal URLs directly in constants, not keys

6530be3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(agents): add pull-wake runner health diagnostics

6c49982

fix(agents): address pull-wake health review findings

15aef19

chore: add changeset for pull-wake health diagnostics

83fb039

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

KyleAMathews and others added 3 commits May 16, 2026 17:51

feat(agents): surface pull-wake runtime diagnostics

7c3a0fb

chore: add changeset for pull-wake runner hardening

d6c5d4d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

KyleAMathews changed the title ~~feat(agents): add pull-wake runner health check and rename owner_user_id to owner_principal~~ feat(agents): pull-wake runner health check, principal rename, and lifecycle hardening May 17, 2026

KyleAMathews added 2 commits May 17, 2026 20:47

fix(agents): avoid delayed pull-wake session startup

e625468

chore: add changeset for pull-wake startup UI

897b7d4

kevin-dp added the claude label May 18, 2026

kevin-dp reviewed May 18, 2026

View reviewed changes

kevin-dp requested changes May 18, 2026

View reviewed changes

icehaunter requested changes May 18, 2026

View reviewed changes

KyleAMathews added 2 commits May 18, 2026 05:48

fix(agents): address pull-wake review blockers

8c841aa

KyleAMathews and others added 7 commits May 18, 2026 08:36

fix(agents-server): avoid send-time dispatch relinks

c3a7fc6

fix(agents): recover pull-wake dispatch races

f6c4d41

test(agents-server): cover pull-wake subscription stack

9f0a2ed

Merge remote-tracking branch 'origin/main' into fix-pull-wake

b5c248e

fix: restore service-scoped pull-wake subscriptions

6966077

fix: treat durable streams urls as opaque prefixes

761876d

fix(agents-server): route subscription control to stream-meta

f299919

fix(agents-desktop): default local send principal

6736276

icehaunter reviewed May 18, 2026

View reviewed changes

Comment thread packages/agents-server/src/routing/durable-streams-routing-adapter.ts Outdated

fix: remove stale durable streams consumer API

d266b29

KyleAMathews added 4 commits May 18, 2026 13:06

fix(agents-server): keep durable streams control urls opaque

c6eda8c

fix(agents-server): relink dispatch subscriptions on send

2e95443

fix(agents): use durable streams backend url in runtime tests

dbf7910

Route local desktop writes through main process

756b7dc

KyleAMathews added 2 commits May 18, 2026 20:18

Avoid local send preflights

846de02

Harden pull-wake runner invariants

e8241f1

icehaunter self-requested a review May 19, 2026 09:42

icehaunter self-assigned this May 19, 2026

kevin-dp approved these changes May 19, 2026

View reviewed changes

kevin-dp mentioned this pull request May 19, 2026

fix(agents-server): preserve consumer_claims.lease_expires_at across heartbeats #4353

Merged

some additional logging & removed stale docs

8f05752

icehaunter approved these changes May 19, 2026

View reviewed changes

icehaunter merged commit e126eba into main May 19, 2026
57 of 58 checks passed

icehaunter deleted the fix-pull-wake branch May 19, 2026 12:20

Conversation

KyleAMathews commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Approach

Runner Health Diagnostics

Principal Ownership

Pull-Wake Runner Lifecycle

Dispatch And Local Send Performance

Key Invariants

Non-goals

Verification

Uh oh!

codecov Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin-dp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-dp commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What worked

Things turned up while testing

Suggestion

Uh oh!

kevin-dp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 18, 2026

Uh oh!

claude Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

What's Working Well

Issues Found

KyleAMathews commented May 16, 2026 •

edited

Loading

codecov Bot commented May 16, 2026 •

edited

Loading

claude Bot commented May 18, 2026 •

edited

Loading

kevin-dp left a comment •

edited

Loading

kevin-dp commented May 18, 2026 •

edited

Loading

kevin-dp left a comment •

edited

Loading

claude Bot commented May 18, 2026 •

edited

Loading

claude Bot commented May 18, 2026 •

edited

Loading

claude Bot commented May 19, 2026 •

edited

Loading