[OpenShell] 'openshell forward list' reports STATUS: dead but there is no supervisor, auto-restart, or alerting — silently unreachable dashboards

## Description

`openshell forward start <port> <sandbox>` spawns an SSH-backed port-forward from the host to a port inside the sandbox pod. When the underlying SSH process dies (e.g. gateway pod restart, `kubectl delete pod`, transient network blip), the forward becomes unusable but openshell keeps the entry in its internal state and `openshell forward list` simply marks the STATUS as `dead`:

```
SANDBOX      BIND      PORT     PID        STATUS
my-assistant 127.0.0.1 18789    1803950    dead
```

No log message is emitted. No supervisor kicks in. No operator-visible warning appears. The dashboard at `http://127.0.0.1:18789/` silently returns connection-refused. An operator tunnelling from a remote machine just sees "Unable to connect" with no indication of the cause.

The status field is already being computed — the same `list` command that prints `dead` has the information needed to act — but no code path consumes it.

## Environment

- OpenShell v0.0.26 (bundled with NemoClaw v0.0.17 and v0.0.18)
- Host: Linux (x86_64); forward is an ssh process spawned by `openshell forward start`
- Trigger: anything that breaks the underlying SSH tunnel — most commonly gateway pod crash or sandbox pod restart

## Reproduction

1. Start a managed forward:
   ```
   openshell forward start 18789 my-assistant -g nemoclaw --background
   openshell forward list       # STATUS: running
   ```
2. Break the underlying transport — easiest is `kubectl delete pod -n openshell my-assistant` on the cluster, which terminates the sandbox the SSH session is bound to.
3. Wait ~30 seconds for the new pod to be Ready.
4. `openshell forward list` now shows `STATUS: dead`. The forward process is gone. Port 18789 no longer listens on the host.
5. Nothing else happens. The `dead` status is the only signal. The dashboard remains unreachable until the operator manually runs `openshell forward stop 18789` and then `openshell forward start 18789 <sandbox> --background` again.

## Expected behaviour

One of:

1. **Supervisor default on:** `openshell forward start` should, by default, keep the forward alive — if the underlying ssh dies, the supervisor restarts it. Expose `--no-auto-restart` for operators who want the current behaviour.
2. **Explicit restart command:** at minimum, add `openshell forward restart <port>` as a first-class command that's idempotent (no-op if already running, restart if dead). Currently the only recovery is `stop` then `start --background` manually.
3. **Alerting:** emit a warning to the gateway logs whenever a forward is observed dead on `list`, or expose a way to subscribe to status changes.

Observability-without-remediation is a sharp edge for any production-adjacent deployment.

## Actual behaviour

`openshell forward --help`:

```
  start  Start forwarding a local port to a sandbox
  stop   Stop a background port forward
  list   List active port forwards
```

No `restart`, no `supervise`, no `monitor`, no `reconcile`. Three verbs, none of which act on the `dead` status.

## Additional sharp edge discovered

When an operator runs `openshell forward start <port> <sandbox>` *without* `--background`, the command blocks on a foreground SSH tunnel. When that foreground session's ssh connection drops (e.g. the operator logs out of the host), the tunnel dies with it — but not cleanly. The child ssh process may persist, bound to the port, untracked by openshell. `openshell forward list` then reports "No active forwards" while `ss -tlnp` shows a process still holding the port. `openshell forward stop 18789` says "No active forward found." Operators are left with an orphan tunnel that serves traffic but cannot be managed by openshell.

Fix suggestion: warn or error when `openshell forward start` is invoked without `--background` over a non-interactive ssh session (e.g., detect `SSH_CONNECTION` env + non-TTY stdin).

## Source citation

`crates/openshell-core/src/forward.rs` (full implementation; ~150 lines). Functions defined:

- `forward_pid_dir()` — where PID files live
- `forward_pid_path(name, port)` — compute PID file path
- `write_forward_pid(name, port, pid, sandbox_id, bind_addr)` — record a forward
- `find_ssh_forward_pid(sandbox_id, port)` — fall back to pgrep to find the actual SSH PID
- `read_forward_pid(name, port)` — parse PID file back
- `pid_is_alive(pid)` — `kill -0 <pid>` check
- `pid_matches_forward(pid, port, sandbox_id)` — validate the PID still matches an SSH command line

There is **no restart function, no supervisor, no retry, no healthcheck loop.** `pid_is_alive` is how the STATUS column is computed — it's a point-in-time check, nothing more. When `pid_is_alive(pid)` returns `false`, the CLI renders `dead` and does nothing else.

That means the `dead` status in `openshell forward list` is observation without remediation by design. Fixing this bug means adding new code paths, not changing existing ones.

## Impact

- **Silent dashboard unreachability** after routine operations (pod restart, CR edit, rolling update). This is the "disappearing dashboard" symptom several users have reported.
- **Operator mental model mismatch:** `openshell forward list` looks like a managed service list, but the entries are not actually being managed.
- **Related:** contributes to a larger downstream NemoClaw bug where `pod restart leaves gateway and port-forward dead, recovery hidden inside connect` — the recovery is only reachable as a side effect of the interactive `nemoclaw connect` command, because there's no dedicated supervisor.

## Suggested fix (preference order)

- **Short term:** add `openshell forward restart <port>` as an idempotent command. No-op if alive, restart if dead. Trivial to implement — all the PID-detection plumbing already exists.
- **Medium term:** have `openshell forward list` auto-restart any `dead` entries it observes (with a `--no-auto-restart` opt-out), or emit a log warning on every observed dead forward.
- **Medium term:** warn or error when `openshell forward start` is invoked without `--background` over a non-interactive ssh session.
- **Longer term:** a proper supervisor process that owns all forwards and keeps them alive.

## Evidence

- Live repro 2026-04-17 — `openshell forward list` showed `STATUS: dead` after a `kubectl delete pod` pod recovery. Dashboard at `127.0.0.1:18789` returned connection-refused until the operator manually ran `openshell forward stop 18789 && openshell forward start 18789 <sandbox> --background`.
- `openshell forward --help` output (v0.0.26, unchanged in NemoClaw v0.0.18 install).
- `ss -tlnp | grep 18789` during the orphan-ssh state showed `ssh pid=2821709` bound to `127.0.0.1:18789`, untracked by `openshell forward list`.
- Source review of `crates/openshell-core/src/forward.rs` confirms no supervisor code exists.

Happy to attach a `nemoclaw debug --output <file>.tgz` capture if useful.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenShell] 'openshell forward list' reports STATUS: dead but there is no supervisor, auto-restart, or alerting — silently unreachable dashboards #874

Description

Environment

Reproduction

Expected behaviour

Actual behaviour

Additional sharp edge discovered

Source citation

Impact

Suggested fix (preference order)

Evidence

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[OpenShell] 'openshell forward list' reports STATUS: dead but there is no supervisor, auto-restart, or alerting — silently unreachable dashboards #874

Description

Description

Environment

Reproduction

Expected behaviour

Actual behaviour

Additional sharp edge discovered

Source citation

Impact

Suggested fix (preference order)

Evidence

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions