fix(help): indefinite hang on container --help / help / no-args by chrisgeo · Pull Request #1455 · apple/container

chrisgeo · 2026-04-27T13:26:27Z

Type of Change

Bug fix
New feature
[?] Breaking change
Documentation update

Motivation and Context

[Why is this change needed?]

Testing

Tested locally
Added/updated tests
Added/updated docs

Summary

container --help, container help, and container (no args) hang indefinitely on systems where com.apple.container.apiserver is dead, wedged, or stale-registered in launchd. This matches the symptom pattern in #1329, #798, and #621.

This PR fixes both the immediate symptom and the deeper XPC-timeout defect that makes the hang unbounded. Three atomic commits:

Document the investigation — docs/internal/help-freeze-analysis.md
Skip API server ping for help and no-args paths — small surgical fix in Application.swift, HelpCommand.swift, DefaultCommand.swift
Make XPCClient.send timeout actually cancel pending replies — replaces the broken TaskGroup pattern with a single-resume gate

Why two commits, not one

There are two independent defects:

Defect A (symptom). Application.main's --help catch block calls createPluginLoader() → ClientHealthCheck.ping() to fetch appRoot/installRoot/logRoot purely so printModifiedHelpText can enrich help with plugin commands. The ping is structurally unnecessary — PluginLoader.alterCLIHelpText only reads pluginDirectories and pluginFactories.
Defect B (depth). XPCClient.send's responseTimeout doesn't actually unblock the function. It races Task.sleep against the XPC reply inside a withThrowingTaskGroup, but the structured-concurrency cleanup awaits the XPC child task, which is suspended in a withCheckedThrowingContinuation that only resumes when the C xpc_connection_send_message_with_reply callback fires. Cancelling a Swift Task does not cancel the C call. If the apiserver is wedged with no reply and no connection invalidation, the child never resumes and the group never returns — regardless of the supplied timeout.

Defect A is the freeze you see today on --help. Defect B is the reason the freeze is indefinite rather than 10s, and it affects every CLI path that calls XPCClient.send (the seven ClientHealthCheck.ping callers plus most of Sources/Services/ContainerAPIService/Client/).

If you'd prefer to land just one of the two, please say so — the commits are independent.

Why the simpler XPC patch wasn't used

The minimal patch for Defect B would have been to wrap the continuation in withTaskCancellationHandler and call xpc_connection_cancel in onCancel. That works for one-shot clients (ClientHealthCheck, ClientImage, ClientKernel, ClientVolume, ClientDiskUsage) but would silently brick the long-lived reusable clients in:

Sources/Services/ContainerAPIService/Client/ContainerClient.swift:36
Sources/Services/ContainerAPIService/Client/NetworkClient.swift:56

xpc_connection_cancel is irreversible — every subsequent send on a cancelled connection fails with XPC_ERROR_CONNECTION_INVALID. So we use a single-resume gate (ResumptionState) instead. The XPC reply callback is allowed to fire whenever it wants; we just stop waiting for it after timeout/cancel. The connection stays valid for the next send.

Known regression in this PR

Plugin enrichment in --help / help / no-args output is removed. The output is the original OVERVIEW: ... USAGE: ... CONTAINER SUBCOMMANDS: ... block, without a trailing PLUGINS: section. We chose this minimal-blast-radius patch over a "filesystem-only plugin discovery" refactor so the freeze fix isn't entangled with API design questions on PluginLoader. The follow-up shape is sketched in docs/internal/help-freeze-analysis.md — happy to do it in this PR or as a follow-up; let us know your preference.

Verification

swift build --product container — clean
make test — 366/366 unit tests pass
make swift-fmt-check — clean

Manual smoke test on macOS 26 with com.apple.container.apiserver registered with launchd but inactive (active count = 0, state = spawn scheduled):

$ time .build/debug/container --help
... full help output ...
real  0m0.11s

$ time .build/debug/container help
... full help output ...
real  0m0.11s

$ time .build/debug/container
... full help output ...
real  0m0.11s

All exit 0 immediately. Before this PR these would hang indefinitely.

What this PR explicitly does not fix

ClientHealthCheck.ping's 60s default xpcRegistrationTimeout (all current call sites override to 2–10s)
Reusable ContainerClient / NetworkClient calling XPCClient.send without a timeout
launchctl bootout-style recovery as a built-in CLI command

These are all viable follow-ups. See the analysis doc for the full list.

Question for reviewers

I'd like to add a unit test that injects a non-firing reply into an XPCClient to lock in both the timeout behavior and the reusable-client guarantee. Would you prefer it in this PR, or as a follow-up?

Workaround for affected users (until merged)

launchctl bootout gui/$(id -u)/com.apple.container.apiserver 2>/dev/null
launchctl bootout user/$(id -u)/com.apple.container.apiserver 2>/dev/null
container system start

Full reasoning, the call-chain trace, the audit of affected XPC callers, and the design tradeoffs are in docs/internal/help-freeze-analysis.md. The doc is intentionally written for adversarial review — please push back on anything that doesn't hold up.

Adds docs/internal/help-freeze-analysis.md describing the two defects that combine to produce an indefinite hang on `container --help` when `com.apple.container.apiserver` is dead, wedged, or stale-registered in launchd: A. The help path requires the daemon to be reachable, because `Application.main` calls `createPluginLoader()` (which pings the API server) before printing help. B. `XPCClient.send`'s timeout cannot actually unblock the function: the structured TaskGroup must await the XPC child task, which is suspended in a `withCheckedThrowingContinuation` that only resumes when the C callback fires. The document is intended to be reviewed alongside the two follow-up commits that implement the fixes.

Help rendering must not depend on `com.apple.container.apiserver` being reachable. When the API server is dead, wedged, or stale-registered in launchd, the previous behavior was an indefinite hang on: container --help container help container All three paths called `Application.createPluginLoader()` (which pings the API server to fetch `appRoot`/`installRoot`/`logRoot`) just to enrich the help text with plugin commands. The ping is structurally unnecessary for help: `PluginLoader.alterCLIHelpText` only reads `pluginDirectories` and `pluginFactories`. This commit removes the call from each help path and extends `printModifiedHelpText` with an optional `unavailableMessage:` so that contexts which deliberately skipped plugin loading do not print the misleading 'PLUGINS: not available, run `container system start`' notice. `DefaultCommand` is reordered so the API server is contacted only when there is a plugin command to dispatch. Plugin enrichment in help output is removed by this commit. A follow-up can restore it by extracting filesystem-only plugin discovery from `PluginLoader.findPlugins` (see docs/internal/help-freeze-analysis.md for the proposed shape). Verified by running each path with no apiserver running on macOS 26: all three return immediately with exit 0 and the original `OVERVIEW: A container platform for macOS` block.

The previous implementation raced `Task.sleep` against the XPC reply inside a `withThrowingTaskGroup`. When the timeout won, structured concurrency required the group to await the XPC child task before the group scope could return — but that child was suspended in a `withCheckedThrowingContinuation` that only resumes when the C `xpc_connection_send_message_with_reply` callback fires. Cancelling a Swift Task does not cancel the underlying C call. If the remote service was wedged (no reply, no connection invalidation), the child never resumed and the group never returned, regardless of the supplied `responseTimeout`. The `responseTimeout` parameter was therefore silently ineffective in exactly the failure mode it was meant to mitigate. This commit replaces the TaskGroup with a single-resume gate (`ResumptionState`) over a `CheckedContinuation` wrapped in a `withTaskCancellationHandler`. The continuation is resumed by whichever of the following completes first: 1. The XPC reply callback fires. 2. `responseTimeout` elapses. 3. The current Task is cancelled. Late completions from the other paths are dropped silently, so the underlying XPC connection remains valid for subsequent sends. This is required for callers that hold a long-lived `XPCClient` (`ContainerClient`, `NetworkClient`); a simpler design that called `xpc_connection_cancel` on timeout would brick those clients after a single timed-out send. Tradeoffs documented in docs/internal/help-freeze-analysis.md: - On timeout/cancel, the eventual late XPC reply is retained by XPC until the connection is released. For short-lived clients this is GC'd within milliseconds; for long-lived reusable clients the worst case is one orphaned `xpc_object_t` per timed-out send. - The unstructured `Task` that runs the timeout sleep is not cancelled when the parent task is cancelled; it wakes up later and becomes a no-op via `tryResume`. Reviewers: a unit test that injects a connection with a non-firing reply would meaningfully cover both the timeout path and the reusable-client guarantee. Happy to add it in this PR or as a follow-up — preference?

The Codex adversarial review of this branch flagged that `XPCClient.send(_:responseTimeout:)` can drop a late XPC reply after its timeout fires. For idempotent reads (`ClientHealthCheck.ping`, `list` operations) that is a deliberate tradeoff: the connection remains valid for subsequent sends and the next caller can re-issue the request. For mutating operations the same behavior is unsafe: the caller surfaces `.timeout`, the user retries, and the original operation may still commit on the server — duplicate or out-of-order container/network state under any slow-but-not-dead daemon. An independent audit of the call sites contradicted the operator note written when the freeze fix was first proposed. Four mutating call sites were already reaching the unsafe path: - ContainerClient.create (containerCreate, 60s default via xpcSend) - NetworkClient.create (networkCreate, 60s default via xpcSend) - NetworkClient.delete (networkDelete, 60s default via xpcSend) - SandboxClient.create (sandboxCreateEndpoint, 60s timeout: param) This commit removes that footgun at the API surface so future call sites cannot reach for it by accident: send(_:) -- mutating-safe; no timeout send(_:timeoutForIdempotentRequest:) -- explicit; late-reply drop acknowledged at call site The old `responseTimeout:` spelling is retained as `@available(*, unavailable, ...)` so any reintroduction in a future patch fails to build with a teaching error pointing at the two overloads. Cancellation contract: - send(_:) checks Task.isCancelled before dispatch via Task.checkCancellation(); after dispatch, cancellation is ignored and the call completes only when the daemon replies or the underlying connection is invalidated. Honoring cancellation after dispatch would re-introduce the same late-commit ambiguity as a timeout. - send(_:timeoutForIdempotentRequest:) keeps the existing reply/timeout/cancellation race semantics, with late replies dropped silently so reusable clients keep working. Call-site migrations: - ContainerClient gains an `xpcSendIdempotent(message:timeout:)` helper. `create` uses the no-timeout `xpcSend(message:)`; `list` uses the idempotent helper with its existing 10s bound. - NetworkClient (APIService) follows the same split: `create` and `delete` use the no-timeout helper; `list` keeps its 1s bound via the idempotent helper. - SandboxClient.create drops its `timeout:` parameter; the only caller (ContainersService) was already passing the default. - ClientHealthCheck.ping calls the idempotent overload with a non-optional Duration. All seven ping callers in ContainerCommands are unchanged at the call site. Tests: a new ContainerXPCTests target uses an in-process `xpc_endpoint_create`-based listener so the contract can be exercised without a live mach service. Six tests cover both overloads: - idempotentTimeoutReturnsWithinBound — verifies the .timeout error code (not .interrupted) and that elapsed time is within the expected window - reusableClientSurvivesIdempotentTimeout — same XPCClient instance survives a timeout and can complete a follow-up send - lateReplyAfterIdempotentTimeoutIsIgnoredCleanly — server replies after the client has timed out; subsequent send still works - plainSendCompletesWhenServerReplies — happy path - plainSendIgnoresCancellationAfterDispatch — Task.cancel() after dispatch must NOT short-circuit; the task waits for the reply - plainSendHonorsCancellationBeforeDispatch — pre-dispatch cancellation surfaces CancellationError What this commit does not address: - No idempotency token or recovery query (Codex's third suggestion). This commit prevents the unsafe combination at the API; it does not give callers a way to safely time out a mutating request and then ask the daemon "did it actually commit?". - Reusable ContainerClient/NetworkClient mutating calls now have no timeout (correctly so, per the new contract). Wedged-daemon scenarios will hang those callers indefinitely; the user-visible workaround (`launchctl bootout`) remains the only escape today. Both are reasonable follow-ups but out of scope for closing the freeze regression.

chrisgeo marked this pull request as draft April 27, 2026 13:26

chrisgeo force-pushed the main branch from 5bca3b4 to 0654cd1 Compare April 27, 2026 13:28

chrisgeo added 4 commits April 27, 2026 13:11

chrisgeo force-pushed the main branch from 0654cd1 to 0c06aa7 Compare April 27, 2026 20:11

chrisgeo marked this pull request as ready for review April 27, 2026 20:14

chrisgeo changed the title ~~DRAFT: fix(help): indefinite hang on container --help / help / no-args~~ fix(help): indefinite hang on container --help / help / no-args Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(help): indefinite hang on container --help / help / no-args#1455

fix(help): indefinite hang on container --help / help / no-args#1455
chrisgeo wants to merge 4 commits intoapple:mainfrom
full-chaos:main

chrisgeo commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisgeo commented Apr 27, 2026

Type of Change

Motivation and Context

Testing

Summary

Why two commits, not one

Why the simpler XPC patch wasn't used

Known regression in this PR

Verification

What this PR explicitly does not fix

Question for reviewers

Workaround for affected users (until merged)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant