Skip to content

Fix/discovery panic hardening#212

Open
EddieHouston wants to merge 2 commits intoBlockstream:new-indexfrom
EddieHouston:fix/discovery-panic-hardening
Open

Fix/discovery panic hardening#212
EddieHouston wants to merge 2 commits intoBlockstream:new-indexfrom
EddieHouston:fix/discovery-panic-hardening

Conversation

@EddieHouston
Copy link
Copy Markdown
Collaborator

Summary

Addresses a panic seen in production in the discovery-jobs thread that was visible in logs. The code had a comment: // FIXME This was an unreachable but it was reached.

  • Replace panicking assert! with warn log in remove_unhealthy_serviceassert!(server.services.remove(&job.service)) panicked when the service was absent from the healthy set due to corrupted state; now logs a warning and continues.
  • Handle the missing-addr branch — the adjacent else FIXME is upgraded to a structured warn with hostname and addr.
  • Recover poisoned RwLocks — all .unwrap() calls on queue and healthy lock acquisitions replaced with .unwrap_or_else(|e| e.into_inner()) so a prior panic doesn't permanently wedge the manager.
  • Wrap health-check loop in catch_unwindspawn_jobs_thread now catches any unhandled panic per iteration, logs an error, and keeps the thread alive.
  • Eliminate .pop().unwrap()run_health_check used a TOCTOU pattern (peek under read lock, then pop under write lock); replaced with let-else that returns early if the queue drains between the two lock acquisitions.
  • Add structured debug logging to save_healthy_service and remove_unhealthy_service.

Tests

Commit history is structured as tests-first: the test commit introduces 3 intentionally failing tests against the original code, and the fix commit turns them all green.

These tests require --features electrum-discovery to compile and run — the same flag needed to build the discovery module itself. CI does not currently run with this flag, so to verify locally:

cargo test --lib --features electrum-discovery -- discovery

Test What it proves
remove_unhealthy_service_missing_service_does_not_panic The original production panic is gone
remove_unhealthy_service_missing_addr_does_not_panic The FIXME-that-was-reached is handled
remove_unhealthy_service_removes_server_when_last_service_gone Normal removal still works
remove_unhealthy_service_keeps_server_when_other_services_remain Partial removal still works
poisoned_healthy_lock_does_not_panic_on_get_servers Poisoned lock recovery on read path
poisoned_queue_lock_does_not_panic_on_run_health_check Poisoned lock recovery on write path

The pre-existing live-network test is marked #[ignore] so it is skipped during local test runs unless explicitly opted in with -- --ignored.

  Three tests are intentionally failing against the current code:
  - remove_unhealthy_service_missing_service_does_not_panic (assert! panic)
  - poisoned_healthy_lock_does_not_panic_on_get_servers (unwrap on poisoned lock)
  - poisoned_queue_lock_does_not_panic_on_run_health_check (unwrap on poisoned lock)

  Also marks the pre-existing live-network test with #[ignore] so it
  does not run in CI.
  Replace assert! with warn! in remove_unhealthy_service, recover
  poisoned RwLocks via into_inner() throughout, and wrap the jobs loop
  body in catch_unwind so a panic cannot kill the discovery thread.
  Log the panic payload on catch so errors are actionable in production.
  Add comments documenting the two-layer defense, the lock-poison
  recovery tradeoff, and the single-consumer TOCTOU assumption.
@EddieHouston EddieHouston force-pushed the fix/discovery-panic-hardening branch from 0307c74 to 685527a Compare April 23, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant