Description
openshell forward start <port> <sandbox> spawns an SSH-backed port-forward from the host to a port inside the sandbox pod. When the underlying SSH process dies (e.g. gateway pod restart, kubectl delete pod, transient network blip), the forward becomes unusable but openshell keeps the entry in its internal state and openshell forward list simply marks the STATUS as dead:
SANDBOX BIND PORT PID STATUS
my-assistant 127.0.0.1 18789 1803950 dead
No log message is emitted. No supervisor kicks in. No operator-visible warning appears. The dashboard at http://127.0.0.1:18789/ silently returns connection-refused. An operator tunnelling from a remote machine just sees "Unable to connect" with no indication of the cause.
The status field is already being computed — the same list command that prints dead has the information needed to act — but no code path consumes it.
Environment
- OpenShell v0.0.26 (bundled with NemoClaw v0.0.17 and v0.0.18)
- Host: Linux (x86_64); forward is an ssh process spawned by
openshell forward start
- Trigger: anything that breaks the underlying SSH tunnel — most commonly gateway pod crash or sandbox pod restart
Reproduction
- Start a managed forward:
openshell forward start 18789 my-assistant -g nemoclaw --background
openshell forward list # STATUS: running
- Break the underlying transport — easiest is
kubectl delete pod -n openshell my-assistant on the cluster, which terminates the sandbox the SSH session is bound to.
- Wait ~30 seconds for the new pod to be Ready.
openshell forward list now shows STATUS: dead. The forward process is gone. Port 18789 no longer listens on the host.
- Nothing else happens. The
dead status is the only signal. The dashboard remains unreachable until the operator manually runs openshell forward stop 18789 and then openshell forward start 18789 <sandbox> --background again.
Expected behaviour
One of:
- Supervisor default on:
openshell forward start should, by default, keep the forward alive — if the underlying ssh dies, the supervisor restarts it. Expose --no-auto-restart for operators who want the current behaviour.
- Explicit restart command: at minimum, add
openshell forward restart <port> as a first-class command that's idempotent (no-op if already running, restart if dead). Currently the only recovery is stop then start --background manually.
- Alerting: emit a warning to the gateway logs whenever a forward is observed dead on
list, or expose a way to subscribe to status changes.
Observability-without-remediation is a sharp edge for any production-adjacent deployment.
Actual behaviour
openshell forward --help:
start Start forwarding a local port to a sandbox
stop Stop a background port forward
list List active port forwards
No restart, no supervise, no monitor, no reconcile. Three verbs, none of which act on the dead status.
Additional sharp edge discovered
When an operator runs openshell forward start <port> <sandbox> without --background, the command blocks on a foreground SSH tunnel. When that foreground session's ssh connection drops (e.g. the operator logs out of the host), the tunnel dies with it — but not cleanly. The child ssh process may persist, bound to the port, untracked by openshell. openshell forward list then reports "No active forwards" while ss -tlnp shows a process still holding the port. openshell forward stop 18789 says "No active forward found." Operators are left with an orphan tunnel that serves traffic but cannot be managed by openshell.
Fix suggestion: warn or error when openshell forward start is invoked without --background over a non-interactive ssh session (e.g., detect SSH_CONNECTION env + non-TTY stdin).
Source citation
crates/openshell-core/src/forward.rs (full implementation; ~150 lines). Functions defined:
forward_pid_dir() — where PID files live
forward_pid_path(name, port) — compute PID file path
write_forward_pid(name, port, pid, sandbox_id, bind_addr) — record a forward
find_ssh_forward_pid(sandbox_id, port) — fall back to pgrep to find the actual SSH PID
read_forward_pid(name, port) — parse PID file back
pid_is_alive(pid) — kill -0 <pid> check
pid_matches_forward(pid, port, sandbox_id) — validate the PID still matches an SSH command line
There is no restart function, no supervisor, no retry, no healthcheck loop. pid_is_alive is how the STATUS column is computed — it's a point-in-time check, nothing more. When pid_is_alive(pid) returns false, the CLI renders dead and does nothing else.
That means the dead status in openshell forward list is observation without remediation by design. Fixing this bug means adding new code paths, not changing existing ones.
Impact
- Silent dashboard unreachability after routine operations (pod restart, CR edit, rolling update). This is the "disappearing dashboard" symptom several users have reported.
- Operator mental model mismatch:
openshell forward list looks like a managed service list, but the entries are not actually being managed.
- Related: contributes to a larger downstream NemoClaw bug where
pod restart leaves gateway and port-forward dead, recovery hidden inside connect — the recovery is only reachable as a side effect of the interactive nemoclaw connect command, because there's no dedicated supervisor.
Suggested fix (preference order)
- Short term: add
openshell forward restart <port> as an idempotent command. No-op if alive, restart if dead. Trivial to implement — all the PID-detection plumbing already exists.
- Medium term: have
openshell forward list auto-restart any dead entries it observes (with a --no-auto-restart opt-out), or emit a log warning on every observed dead forward.
- Medium term: warn or error when
openshell forward start is invoked without --background over a non-interactive ssh session.
- Longer term: a proper supervisor process that owns all forwards and keeps them alive.
Evidence
- Live repro 2026-04-17 —
openshell forward list showed STATUS: dead after a kubectl delete pod pod recovery. Dashboard at 127.0.0.1:18789 returned connection-refused until the operator manually ran openshell forward stop 18789 && openshell forward start 18789 <sandbox> --background.
openshell forward --help output (v0.0.26, unchanged in NemoClaw v0.0.18 install).
ss -tlnp | grep 18789 during the orphan-ssh state showed ssh pid=2821709 bound to 127.0.0.1:18789, untracked by openshell forward list.
- Source review of
crates/openshell-core/src/forward.rs confirms no supervisor code exists.
Happy to attach a nemoclaw debug --output <file>.tgz capture if useful.
Description
openshell forward start <port> <sandbox>spawns an SSH-backed port-forward from the host to a port inside the sandbox pod. When the underlying SSH process dies (e.g. gateway pod restart,kubectl delete pod, transient network blip), the forward becomes unusable but openshell keeps the entry in its internal state andopenshell forward listsimply marks the STATUS asdead:No log message is emitted. No supervisor kicks in. No operator-visible warning appears. The dashboard at
http://127.0.0.1:18789/silently returns connection-refused. An operator tunnelling from a remote machine just sees "Unable to connect" with no indication of the cause.The status field is already being computed — the same
listcommand that printsdeadhas the information needed to act — but no code path consumes it.Environment
openshell forward startReproduction
kubectl delete pod -n openshell my-assistanton the cluster, which terminates the sandbox the SSH session is bound to.openshell forward listnow showsSTATUS: dead. The forward process is gone. Port 18789 no longer listens on the host.deadstatus is the only signal. The dashboard remains unreachable until the operator manually runsopenshell forward stop 18789and thenopenshell forward start 18789 <sandbox> --backgroundagain.Expected behaviour
One of:
openshell forward startshould, by default, keep the forward alive — if the underlying ssh dies, the supervisor restarts it. Expose--no-auto-restartfor operators who want the current behaviour.openshell forward restart <port>as a first-class command that's idempotent (no-op if already running, restart if dead). Currently the only recovery isstopthenstart --backgroundmanually.list, or expose a way to subscribe to status changes.Observability-without-remediation is a sharp edge for any production-adjacent deployment.
Actual behaviour
openshell forward --help:No
restart, nosupervise, nomonitor, noreconcile. Three verbs, none of which act on thedeadstatus.Additional sharp edge discovered
When an operator runs
openshell forward start <port> <sandbox>without--background, the command blocks on a foreground SSH tunnel. When that foreground session's ssh connection drops (e.g. the operator logs out of the host), the tunnel dies with it — but not cleanly. The child ssh process may persist, bound to the port, untracked by openshell.openshell forward listthen reports "No active forwards" whiless -tlnpshows a process still holding the port.openshell forward stop 18789says "No active forward found." Operators are left with an orphan tunnel that serves traffic but cannot be managed by openshell.Fix suggestion: warn or error when
openshell forward startis invoked without--backgroundover a non-interactive ssh session (e.g., detectSSH_CONNECTIONenv + non-TTY stdin).Source citation
crates/openshell-core/src/forward.rs(full implementation; ~150 lines). Functions defined:forward_pid_dir()— where PID files liveforward_pid_path(name, port)— compute PID file pathwrite_forward_pid(name, port, pid, sandbox_id, bind_addr)— record a forwardfind_ssh_forward_pid(sandbox_id, port)— fall back to pgrep to find the actual SSH PIDread_forward_pid(name, port)— parse PID file backpid_is_alive(pid)—kill -0 <pid>checkpid_matches_forward(pid, port, sandbox_id)— validate the PID still matches an SSH command lineThere is no restart function, no supervisor, no retry, no healthcheck loop.
pid_is_aliveis how the STATUS column is computed — it's a point-in-time check, nothing more. Whenpid_is_alive(pid)returnsfalse, the CLI rendersdeadand does nothing else.That means the
deadstatus inopenshell forward listis observation without remediation by design. Fixing this bug means adding new code paths, not changing existing ones.Impact
openshell forward listlooks like a managed service list, but the entries are not actually being managed.pod restart leaves gateway and port-forward dead, recovery hidden inside connect— the recovery is only reachable as a side effect of the interactivenemoclaw connectcommand, because there's no dedicated supervisor.Suggested fix (preference order)
openshell forward restart <port>as an idempotent command. No-op if alive, restart if dead. Trivial to implement — all the PID-detection plumbing already exists.openshell forward listauto-restart anydeadentries it observes (with a--no-auto-restartopt-out), or emit a log warning on every observed dead forward.openshell forward startis invoked without--backgroundover a non-interactive ssh session.Evidence
openshell forward listshowedSTATUS: deadafter akubectl delete podpod recovery. Dashboard at127.0.0.1:18789returned connection-refused until the operator manually ranopenshell forward stop 18789 && openshell forward start 18789 <sandbox> --background.openshell forward --helpoutput (v0.0.26, unchanged in NemoClaw v0.0.18 install).ss -tlnp | grep 18789during the orphan-ssh state showedssh pid=2821709bound to127.0.0.1:18789, untracked byopenshell forward list.crates/openshell-core/src/forward.rsconfirms no supervisor code exists.Happy to attach a
nemoclaw debug --output <file>.tgzcapture if useful.