Fix/discovery panic hardening by EddieHouston · Pull Request #212 · Blockstream/electrs

EddieHouston · 2026-04-23T12:13:52Z

Summary

Addresses a panic seen in production in the discovery-jobs thread that was visible in logs. The code had a comment: // FIXME This was an unreachable but it was reached.

Replace panicking assert! with warn log in remove_unhealthy_service — assert!(server.services.remove(&job.service)) panicked when the service was absent from the healthy set due to corrupted state; now logs a warning and continues.
Handle the missing-addr branch — the adjacent else FIXME is upgraded to a structured warn with hostname and addr.
Recover poisoned RwLocks — all .unwrap() calls on queue and healthy lock acquisitions replaced with .unwrap_or_else(|e| e.into_inner()) so a prior panic doesn't permanently wedge the manager.
Wrap health-check loop in catch_unwind — spawn_jobs_thread now catches any unhandled panic per iteration, logs an error, and keeps the thread alive.
Eliminate .pop().unwrap() — run_health_check used a TOCTOU pattern (peek under read lock, then pop under write lock); replaced with let-else that returns early if the queue drains between the two lock acquisitions.
Add structured debug logging to save_healthy_service and remove_unhealthy_service.

Tests

Commit history is structured as tests-first: the test commit introduces 3 intentionally failing tests against the original code, and the fix commit turns them all green.

These tests require --features electrum-discovery to compile and run — the same flag needed to build the discovery module itself. CI does not currently run with this flag, so to verify locally:

cargo test --lib --features electrum-discovery -- discovery

Test	What it proves
`remove_unhealthy_service_missing_service_does_not_panic`	The original production panic is gone
`remove_unhealthy_service_missing_addr_does_not_panic`	The FIXME-that-was-reached is handled
`remove_unhealthy_service_removes_server_when_last_service_gone`	Normal removal still works
`remove_unhealthy_service_keeps_server_when_other_services_remain`	Partial removal still works
`poisoned_healthy_lock_does_not_panic_on_get_servers`	Poisoned lock recovery on read path
`poisoned_queue_lock_does_not_panic_on_run_health_check`	Poisoned lock recovery on write path

The pre-existing live-network test is marked #[ignore] so it is skipped during local test runs unless explicitly opted in with -- --ignored.

Three tests are intentionally failing against the current code: - remove_unhealthy_service_missing_service_does_not_panic (assert! panic) - poisoned_healthy_lock_does_not_panic_on_get_servers (unwrap on poisoned lock) - poisoned_queue_lock_does_not_panic_on_run_health_check (unwrap on poisoned lock) Also marks the pre-existing live-network test with #[ignore] so it does not run in CI.

Replace assert! with warn! in remove_unhealthy_service, recover poisoned RwLocks via into_inner() throughout, and wrap the jobs loop body in catch_unwind so a panic cannot kill the discovery thread. Log the panic payload on catch so errors are actionable in production. Add comments documenting the two-layer defense, the lock-poison recovery tradeoff, and the single-consumer TOCTOU assumption.

EddieHouston requested review from RCasatta, philippem and shesek April 23, 2026 12:13

EddieHouston added 2 commits April 23, 2026 14:14

EddieHouston force-pushed the fix/discovery-panic-hardening branch from 0307c74 to 685527a Compare April 23, 2026 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/discovery panic hardening#212

Fix/discovery panic hardening#212
EddieHouston wants to merge 2 commits intoBlockstream:new-indexfrom
EddieHouston:fix/discovery-panic-hardening

EddieHouston commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EddieHouston commented Apr 23, 2026

Summary

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant