Skip to content

fix(help): indefinite hang on container --help / help / no-args#1455

Open
chrisgeo wants to merge 4 commits intoapple:mainfrom
full-chaos:main
Open

fix(help): indefinite hang on container --help / help / no-args#1455
chrisgeo wants to merge 4 commits intoapple:mainfrom
full-chaos:main

Conversation

@chrisgeo
Copy link
Copy Markdown

Type of Change

  • Bug fix
  • New feature
  • [?] Breaking change
  • Documentation update

Motivation and Context

[Why is this change needed?]

Testing

  • Tested locally
  • Added/updated tests
  • Added/updated docs

Summary

container --help, container help, and container (no args) hang indefinitely on systems where com.apple.container.apiserver is dead, wedged, or stale-registered in launchd. This matches the symptom pattern in #1329, #798, and #621.

This PR fixes both the immediate symptom and the deeper XPC-timeout defect that makes the hang unbounded. Three atomic commits:

  1. Document the investigationdocs/internal/help-freeze-analysis.md
  2. Skip API server ping for help and no-args paths — small surgical fix in Application.swift, HelpCommand.swift, DefaultCommand.swift
  3. Make XPCClient.send timeout actually cancel pending replies — replaces the broken TaskGroup pattern with a single-resume gate

Why two commits, not one

There are two independent defects:

  • Defect A (symptom). Application.main's --help catch block calls createPluginLoader()ClientHealthCheck.ping() to fetch appRoot/installRoot/logRoot purely so printModifiedHelpText can enrich help with plugin commands. The ping is structurally unnecessary — PluginLoader.alterCLIHelpText only reads pluginDirectories and pluginFactories.

  • Defect B (depth). XPCClient.send's responseTimeout doesn't actually unblock the function. It races Task.sleep against the XPC reply inside a withThrowingTaskGroup, but the structured-concurrency cleanup awaits the XPC child task, which is suspended in a withCheckedThrowingContinuation that only resumes when the C xpc_connection_send_message_with_reply callback fires. Cancelling a Swift Task does not cancel the C call. If the apiserver is wedged with no reply and no connection invalidation, the child never resumes and the group never returns — regardless of the supplied timeout.

Defect A is the freeze you see today on --help. Defect B is the reason the freeze is indefinite rather than 10s, and it affects every CLI path that calls XPCClient.send (the seven ClientHealthCheck.ping callers plus most of Sources/Services/ContainerAPIService/Client/).

If you'd prefer to land just one of the two, please say so — the commits are independent.

Why the simpler XPC patch wasn't used

The minimal patch for Defect B would have been to wrap the continuation in withTaskCancellationHandler and call xpc_connection_cancel in onCancel. That works for one-shot clients (ClientHealthCheck, ClientImage, ClientKernel, ClientVolume, ClientDiskUsage) but would silently brick the long-lived reusable clients in:

  • Sources/Services/ContainerAPIService/Client/ContainerClient.swift:36
  • Sources/Services/ContainerAPIService/Client/NetworkClient.swift:56

xpc_connection_cancel is irreversible — every subsequent send on a cancelled connection fails with XPC_ERROR_CONNECTION_INVALID. So we use a single-resume gate (ResumptionState) instead. The XPC reply callback is allowed to fire whenever it wants; we just stop waiting for it after timeout/cancel. The connection stays valid for the next send.

Known regression in this PR

Plugin enrichment in --help / help / no-args output is removed. The output is the original OVERVIEW: ... USAGE: ... CONTAINER SUBCOMMANDS: ... block, without a trailing PLUGINS: section. We chose this minimal-blast-radius patch over a "filesystem-only plugin discovery" refactor so the freeze fix isn't entangled with API design questions on PluginLoader. The follow-up shape is sketched in docs/internal/help-freeze-analysis.md — happy to do it in this PR or as a follow-up; let us know your preference.

Verification

  • swift build --product container — clean

  • make test — 366/366 unit tests pass

  • make swift-fmt-check — clean

  • Manual smoke test on macOS 26 with com.apple.container.apiserver registered with launchd but inactive (active count = 0, state = spawn scheduled):

    $ time .build/debug/container --help
    ... full help output ...
    real  0m0.11s
    
    $ time .build/debug/container help
    ... full help output ...
    real  0m0.11s
    
    $ time .build/debug/container
    ... full help output ...
    real  0m0.11s
    

    All exit 0 immediately. Before this PR these would hang indefinitely.

What this PR explicitly does not fix

  1. ClientHealthCheck.ping's 60s default xpcRegistrationTimeout (all current call sites override to 2–10s)
  2. Reusable ContainerClient / NetworkClient calling XPCClient.send without a timeout
  3. launchctl bootout-style recovery as a built-in CLI command

These are all viable follow-ups. See the analysis doc for the full list.

Question for reviewers

I'd like to add a unit test that injects a non-firing reply into an XPCClient to lock in both the timeout behavior and the reusable-client guarantee. Would you prefer it in this PR, or as a follow-up?

Workaround for affected users (until merged)

launchctl bootout gui/$(id -u)/com.apple.container.apiserver 2>/dev/null
launchctl bootout user/$(id -u)/com.apple.container.apiserver 2>/dev/null
container system start

Full reasoning, the call-chain trace, the audit of affected XPC callers, and the design tradeoffs are in docs/internal/help-freeze-analysis.md. The doc is intentionally written for adversarial review — please push back on anything that doesn't hold up.

@chrisgeo chrisgeo marked this pull request as draft April 27, 2026 13:26
Adds docs/internal/help-freeze-analysis.md describing the two defects that
combine to produce an indefinite hang on `container --help` when
`com.apple.container.apiserver` is dead, wedged, or stale-registered in
launchd:

  A. The help path requires the daemon to be reachable, because
     `Application.main` calls `createPluginLoader()` (which pings the
     API server) before printing help.

  B. `XPCClient.send`'s timeout cannot actually unblock the function:
     the structured TaskGroup must await the XPC child task, which is
     suspended in a `withCheckedThrowingContinuation` that only resumes
     when the C callback fires.

The document is intended to be reviewed alongside the two follow-up
commits that implement the fixes.
Help rendering must not depend on `com.apple.container.apiserver` being
reachable. When the API server is dead, wedged, or stale-registered in
launchd, the previous behavior was an indefinite hang on:

  container --help
  container help
  container

All three paths called `Application.createPluginLoader()` (which pings
the API server to fetch `appRoot`/`installRoot`/`logRoot`) just to
enrich the help text with plugin commands. The ping is structurally
unnecessary for help: `PluginLoader.alterCLIHelpText` only reads
`pluginDirectories` and `pluginFactories`.

This commit removes the call from each help path and extends
`printModifiedHelpText` with an optional `unavailableMessage:` so that
contexts which deliberately skipped plugin loading do not print the
misleading 'PLUGINS: not available, run `container system start`'
notice. `DefaultCommand` is reordered so the API server is contacted
only when there is a plugin command to dispatch.

Plugin enrichment in help output is removed by this commit. A follow-up
can restore it by extracting filesystem-only plugin discovery from
`PluginLoader.findPlugins` (see docs/internal/help-freeze-analysis.md
for the proposed shape).

Verified by running each path with no apiserver running on macOS 26:
all three return immediately with exit 0 and the original
`OVERVIEW: A container platform for macOS` block.
The previous implementation raced `Task.sleep` against the XPC reply
inside a `withThrowingTaskGroup`. When the timeout won, structured
concurrency required the group to await the XPC child task before the
group scope could return — but that child was suspended in a
`withCheckedThrowingContinuation` that only resumes when the C
`xpc_connection_send_message_with_reply` callback fires. Cancelling a
Swift Task does not cancel the underlying C call. If the remote service
was wedged (no reply, no connection invalidation), the child never
resumed and the group never returned, regardless of the supplied
`responseTimeout`. The `responseTimeout` parameter was therefore
silently ineffective in exactly the failure mode it was meant to
mitigate.

This commit replaces the TaskGroup with a single-resume gate
(`ResumptionState`) over a `CheckedContinuation` wrapped in a
`withTaskCancellationHandler`. The continuation is resumed by whichever
of the following completes first:

  1. The XPC reply callback fires.
  2. `responseTimeout` elapses.
  3. The current Task is cancelled.

Late completions from the other paths are dropped silently, so the
underlying XPC connection remains valid for subsequent sends. This is
required for callers that hold a long-lived `XPCClient`
(`ContainerClient`, `NetworkClient`); a simpler design that called
`xpc_connection_cancel` on timeout would brick those clients after a
single timed-out send.

Tradeoffs documented in docs/internal/help-freeze-analysis.md:
  - On timeout/cancel, the eventual late XPC reply is retained by XPC
    until the connection is released. For short-lived clients this is
    GC'd within milliseconds; for long-lived reusable clients the worst
    case is one orphaned `xpc_object_t` per timed-out send.
  - The unstructured `Task` that runs the timeout sleep is not
    cancelled when the parent task is cancelled; it wakes up later and
    becomes a no-op via `tryResume`.

Reviewers: a unit test that injects a connection with a non-firing reply
would meaningfully cover both the timeout path and the reusable-client
guarantee. Happy to add it in this PR or as a follow-up — preference?
The Codex adversarial review of this branch flagged that
`XPCClient.send(_:responseTimeout:)` can drop a late XPC reply after
its timeout fires. For idempotent reads (`ClientHealthCheck.ping`,
`list` operations) that is a deliberate tradeoff: the connection
remains valid for subsequent sends and the next caller can re-issue
the request. For mutating operations the same behavior is unsafe:
the caller surfaces `.timeout`, the user retries, and the original
operation may still commit on the server — duplicate or out-of-order
container/network state under any slow-but-not-dead daemon.

An independent audit of the call sites contradicted the operator note
written when the freeze fix was first proposed. Four mutating call
sites were already reaching the unsafe path:

  - ContainerClient.create (containerCreate, 60s default via xpcSend)
  - NetworkClient.create   (networkCreate,  60s default via xpcSend)
  - NetworkClient.delete   (networkDelete,  60s default via xpcSend)
  - SandboxClient.create   (sandboxCreateEndpoint, 60s timeout: param)

This commit removes that footgun at the API surface so future call
sites cannot reach for it by accident:

  send(_:)                                 -- mutating-safe; no timeout
  send(_:timeoutForIdempotentRequest:)     -- explicit; late-reply drop
                                              acknowledged at call site

The old `responseTimeout:` spelling is retained as
`@available(*, unavailable, ...)` so any reintroduction in a future
patch fails to build with a teaching error pointing at the two
overloads.

Cancellation contract:

  - send(_:) checks Task.isCancelled before dispatch via
    Task.checkCancellation(); after dispatch, cancellation is ignored
    and the call completes only when the daemon replies or the
    underlying connection is invalidated. Honoring cancellation after
    dispatch would re-introduce the same late-commit ambiguity as a
    timeout.
  - send(_:timeoutForIdempotentRequest:) keeps the existing
    reply/timeout/cancellation race semantics, with late replies
    dropped silently so reusable clients keep working.

Call-site migrations:

  - ContainerClient gains an `xpcSendIdempotent(message:timeout:)`
    helper. `create` uses the no-timeout `xpcSend(message:)`; `list`
    uses the idempotent helper with its existing 10s bound.
  - NetworkClient (APIService) follows the same split: `create` and
    `delete` use the no-timeout helper; `list` keeps its 1s bound via
    the idempotent helper.
  - SandboxClient.create drops its `timeout:` parameter; the only
    caller (ContainersService) was already passing the default.
  - ClientHealthCheck.ping calls the idempotent overload with a
    non-optional Duration. All seven ping callers in ContainerCommands
    are unchanged at the call site.

Tests: a new ContainerXPCTests target uses an in-process
`xpc_endpoint_create`-based listener so the contract can be exercised
without a live mach service. Six tests cover both overloads:

  - idempotentTimeoutReturnsWithinBound — verifies the .timeout error
    code (not .interrupted) and that elapsed time is within the
    expected window
  - reusableClientSurvivesIdempotentTimeout — same XPCClient instance
    survives a timeout and can complete a follow-up send
  - lateReplyAfterIdempotentTimeoutIsIgnoredCleanly — server replies
    after the client has timed out; subsequent send still works
  - plainSendCompletesWhenServerReplies — happy path
  - plainSendIgnoresCancellationAfterDispatch — Task.cancel() after
    dispatch must NOT short-circuit; the task waits for the reply
  - plainSendHonorsCancellationBeforeDispatch — pre-dispatch
    cancellation surfaces CancellationError

What this commit does not address:

  - No idempotency token or recovery query (Codex's third suggestion).
    This commit prevents the unsafe combination at the API; it does not
    give callers a way to safely time out a mutating request and then
    ask the daemon "did it actually commit?".
  - Reusable ContainerClient/NetworkClient mutating calls now have no
    timeout (correctly so, per the new contract). Wedged-daemon
    scenarios will hang those callers indefinitely; the user-visible
    workaround (`launchctl bootout`) remains the only escape today.

Both are reasonable follow-ups but out of scope for closing the freeze
regression.
@chrisgeo chrisgeo marked this pull request as ready for review April 27, 2026 20:14
@chrisgeo chrisgeo changed the title DRAFT: fix(help): indefinite hang on container --help / help / no-args fix(help): indefinite hang on container --help / help / no-args Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant