Skip to content

[OpenShell] 'openshell forward list' reports STATUS: dead but there is no supervisor, auto-restart, or alerting — silently unreachable dashboards #874

@davidglogan

Description

@davidglogan

Description

openshell forward start <port> <sandbox> spawns an SSH-backed port-forward from the host to a port inside the sandbox pod. When the underlying SSH process dies (e.g. gateway pod restart, kubectl delete pod, transient network blip), the forward becomes unusable but openshell keeps the entry in its internal state and openshell forward list simply marks the STATUS as dead:

SANDBOX      BIND      PORT     PID        STATUS
my-assistant 127.0.0.1 18789    1803950    dead

No log message is emitted. No supervisor kicks in. No operator-visible warning appears. The dashboard at http://127.0.0.1:18789/ silently returns connection-refused. An operator tunnelling from a remote machine just sees "Unable to connect" with no indication of the cause.

The status field is already being computed — the same list command that prints dead has the information needed to act — but no code path consumes it.

Environment

  • OpenShell v0.0.26 (bundled with NemoClaw v0.0.17 and v0.0.18)
  • Host: Linux (x86_64); forward is an ssh process spawned by openshell forward start
  • Trigger: anything that breaks the underlying SSH tunnel — most commonly gateway pod crash or sandbox pod restart

Reproduction

  1. Start a managed forward:
    openshell forward start 18789 my-assistant -g nemoclaw --background
    openshell forward list       # STATUS: running
    
  2. Break the underlying transport — easiest is kubectl delete pod -n openshell my-assistant on the cluster, which terminates the sandbox the SSH session is bound to.
  3. Wait ~30 seconds for the new pod to be Ready.
  4. openshell forward list now shows STATUS: dead. The forward process is gone. Port 18789 no longer listens on the host.
  5. Nothing else happens. The dead status is the only signal. The dashboard remains unreachable until the operator manually runs openshell forward stop 18789 and then openshell forward start 18789 <sandbox> --background again.

Expected behaviour

One of:

  1. Supervisor default on: openshell forward start should, by default, keep the forward alive — if the underlying ssh dies, the supervisor restarts it. Expose --no-auto-restart for operators who want the current behaviour.
  2. Explicit restart command: at minimum, add openshell forward restart <port> as a first-class command that's idempotent (no-op if already running, restart if dead). Currently the only recovery is stop then start --background manually.
  3. Alerting: emit a warning to the gateway logs whenever a forward is observed dead on list, or expose a way to subscribe to status changes.

Observability-without-remediation is a sharp edge for any production-adjacent deployment.

Actual behaviour

openshell forward --help:

  start  Start forwarding a local port to a sandbox
  stop   Stop a background port forward
  list   List active port forwards

No restart, no supervise, no monitor, no reconcile. Three verbs, none of which act on the dead status.

Additional sharp edge discovered

When an operator runs openshell forward start <port> <sandbox> without --background, the command blocks on a foreground SSH tunnel. When that foreground session's ssh connection drops (e.g. the operator logs out of the host), the tunnel dies with it — but not cleanly. The child ssh process may persist, bound to the port, untracked by openshell. openshell forward list then reports "No active forwards" while ss -tlnp shows a process still holding the port. openshell forward stop 18789 says "No active forward found." Operators are left with an orphan tunnel that serves traffic but cannot be managed by openshell.

Fix suggestion: warn or error when openshell forward start is invoked without --background over a non-interactive ssh session (e.g., detect SSH_CONNECTION env + non-TTY stdin).

Source citation

crates/openshell-core/src/forward.rs (full implementation; ~150 lines). Functions defined:

  • forward_pid_dir() — where PID files live
  • forward_pid_path(name, port) — compute PID file path
  • write_forward_pid(name, port, pid, sandbox_id, bind_addr) — record a forward
  • find_ssh_forward_pid(sandbox_id, port) — fall back to pgrep to find the actual SSH PID
  • read_forward_pid(name, port) — parse PID file back
  • pid_is_alive(pid)kill -0 <pid> check
  • pid_matches_forward(pid, port, sandbox_id) — validate the PID still matches an SSH command line

There is no restart function, no supervisor, no retry, no healthcheck loop. pid_is_alive is how the STATUS column is computed — it's a point-in-time check, nothing more. When pid_is_alive(pid) returns false, the CLI renders dead and does nothing else.

That means the dead status in openshell forward list is observation without remediation by design. Fixing this bug means adding new code paths, not changing existing ones.

Impact

  • Silent dashboard unreachability after routine operations (pod restart, CR edit, rolling update). This is the "disappearing dashboard" symptom several users have reported.
  • Operator mental model mismatch: openshell forward list looks like a managed service list, but the entries are not actually being managed.
  • Related: contributes to a larger downstream NemoClaw bug where pod restart leaves gateway and port-forward dead, recovery hidden inside connect — the recovery is only reachable as a side effect of the interactive nemoclaw connect command, because there's no dedicated supervisor.

Suggested fix (preference order)

  • Short term: add openshell forward restart <port> as an idempotent command. No-op if alive, restart if dead. Trivial to implement — all the PID-detection plumbing already exists.
  • Medium term: have openshell forward list auto-restart any dead entries it observes (with a --no-auto-restart opt-out), or emit a log warning on every observed dead forward.
  • Medium term: warn or error when openshell forward start is invoked without --background over a non-interactive ssh session.
  • Longer term: a proper supervisor process that owns all forwards and keeps them alive.

Evidence

  • Live repro 2026-04-17 — openshell forward list showed STATUS: dead after a kubectl delete pod pod recovery. Dashboard at 127.0.0.1:18789 returned connection-refused until the operator manually ran openshell forward stop 18789 && openshell forward start 18789 <sandbox> --background.
  • openshell forward --help output (v0.0.26, unchanged in NemoClaw v0.0.18 install).
  • ss -tlnp | grep 18789 during the orphan-ssh state showed ssh pid=2821709 bound to 127.0.0.1:18789, untracked by openshell forward list.
  • Source review of crates/openshell-core/src/forward.rs confirms no supervisor code exists.

Happy to attach a nemoclaw debug --output <file>.tgz capture if useful.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions