fix(help): indefinite hang on container --help / help / no-args#1455
Open
chrisgeo wants to merge 4 commits intoapple:mainfrom
Open
fix(help): indefinite hang on container --help / help / no-args#1455chrisgeo wants to merge 4 commits intoapple:mainfrom
chrisgeo wants to merge 4 commits intoapple:mainfrom
Conversation
Adds docs/internal/help-freeze-analysis.md describing the two defects that
combine to produce an indefinite hang on `container --help` when
`com.apple.container.apiserver` is dead, wedged, or stale-registered in
launchd:
A. The help path requires the daemon to be reachable, because
`Application.main` calls `createPluginLoader()` (which pings the
API server) before printing help.
B. `XPCClient.send`'s timeout cannot actually unblock the function:
the structured TaskGroup must await the XPC child task, which is
suspended in a `withCheckedThrowingContinuation` that only resumes
when the C callback fires.
The document is intended to be reviewed alongside the two follow-up
commits that implement the fixes.
Help rendering must not depend on `com.apple.container.apiserver` being reachable. When the API server is dead, wedged, or stale-registered in launchd, the previous behavior was an indefinite hang on: container --help container help container All three paths called `Application.createPluginLoader()` (which pings the API server to fetch `appRoot`/`installRoot`/`logRoot`) just to enrich the help text with plugin commands. The ping is structurally unnecessary for help: `PluginLoader.alterCLIHelpText` only reads `pluginDirectories` and `pluginFactories`. This commit removes the call from each help path and extends `printModifiedHelpText` with an optional `unavailableMessage:` so that contexts which deliberately skipped plugin loading do not print the misleading 'PLUGINS: not available, run `container system start`' notice. `DefaultCommand` is reordered so the API server is contacted only when there is a plugin command to dispatch. Plugin enrichment in help output is removed by this commit. A follow-up can restore it by extracting filesystem-only plugin discovery from `PluginLoader.findPlugins` (see docs/internal/help-freeze-analysis.md for the proposed shape). Verified by running each path with no apiserver running on macOS 26: all three return immediately with exit 0 and the original `OVERVIEW: A container platform for macOS` block.
The previous implementation raced `Task.sleep` against the XPC reply
inside a `withThrowingTaskGroup`. When the timeout won, structured
concurrency required the group to await the XPC child task before the
group scope could return — but that child was suspended in a
`withCheckedThrowingContinuation` that only resumes when the C
`xpc_connection_send_message_with_reply` callback fires. Cancelling a
Swift Task does not cancel the underlying C call. If the remote service
was wedged (no reply, no connection invalidation), the child never
resumed and the group never returned, regardless of the supplied
`responseTimeout`. The `responseTimeout` parameter was therefore
silently ineffective in exactly the failure mode it was meant to
mitigate.
This commit replaces the TaskGroup with a single-resume gate
(`ResumptionState`) over a `CheckedContinuation` wrapped in a
`withTaskCancellationHandler`. The continuation is resumed by whichever
of the following completes first:
1. The XPC reply callback fires.
2. `responseTimeout` elapses.
3. The current Task is cancelled.
Late completions from the other paths are dropped silently, so the
underlying XPC connection remains valid for subsequent sends. This is
required for callers that hold a long-lived `XPCClient`
(`ContainerClient`, `NetworkClient`); a simpler design that called
`xpc_connection_cancel` on timeout would brick those clients after a
single timed-out send.
Tradeoffs documented in docs/internal/help-freeze-analysis.md:
- On timeout/cancel, the eventual late XPC reply is retained by XPC
until the connection is released. For short-lived clients this is
GC'd within milliseconds; for long-lived reusable clients the worst
case is one orphaned `xpc_object_t` per timed-out send.
- The unstructured `Task` that runs the timeout sleep is not
cancelled when the parent task is cancelled; it wakes up later and
becomes a no-op via `tryResume`.
Reviewers: a unit test that injects a connection with a non-firing reply
would meaningfully cover both the timeout path and the reusable-client
guarantee. Happy to add it in this PR or as a follow-up — preference?
The Codex adversarial review of this branch flagged that
`XPCClient.send(_:responseTimeout:)` can drop a late XPC reply after
its timeout fires. For idempotent reads (`ClientHealthCheck.ping`,
`list` operations) that is a deliberate tradeoff: the connection
remains valid for subsequent sends and the next caller can re-issue
the request. For mutating operations the same behavior is unsafe:
the caller surfaces `.timeout`, the user retries, and the original
operation may still commit on the server — duplicate or out-of-order
container/network state under any slow-but-not-dead daemon.
An independent audit of the call sites contradicted the operator note
written when the freeze fix was first proposed. Four mutating call
sites were already reaching the unsafe path:
- ContainerClient.create (containerCreate, 60s default via xpcSend)
- NetworkClient.create (networkCreate, 60s default via xpcSend)
- NetworkClient.delete (networkDelete, 60s default via xpcSend)
- SandboxClient.create (sandboxCreateEndpoint, 60s timeout: param)
This commit removes that footgun at the API surface so future call
sites cannot reach for it by accident:
send(_:) -- mutating-safe; no timeout
send(_:timeoutForIdempotentRequest:) -- explicit; late-reply drop
acknowledged at call site
The old `responseTimeout:` spelling is retained as
`@available(*, unavailable, ...)` so any reintroduction in a future
patch fails to build with a teaching error pointing at the two
overloads.
Cancellation contract:
- send(_:) checks Task.isCancelled before dispatch via
Task.checkCancellation(); after dispatch, cancellation is ignored
and the call completes only when the daemon replies or the
underlying connection is invalidated. Honoring cancellation after
dispatch would re-introduce the same late-commit ambiguity as a
timeout.
- send(_:timeoutForIdempotentRequest:) keeps the existing
reply/timeout/cancellation race semantics, with late replies
dropped silently so reusable clients keep working.
Call-site migrations:
- ContainerClient gains an `xpcSendIdempotent(message:timeout:)`
helper. `create` uses the no-timeout `xpcSend(message:)`; `list`
uses the idempotent helper with its existing 10s bound.
- NetworkClient (APIService) follows the same split: `create` and
`delete` use the no-timeout helper; `list` keeps its 1s bound via
the idempotent helper.
- SandboxClient.create drops its `timeout:` parameter; the only
caller (ContainersService) was already passing the default.
- ClientHealthCheck.ping calls the idempotent overload with a
non-optional Duration. All seven ping callers in ContainerCommands
are unchanged at the call site.
Tests: a new ContainerXPCTests target uses an in-process
`xpc_endpoint_create`-based listener so the contract can be exercised
without a live mach service. Six tests cover both overloads:
- idempotentTimeoutReturnsWithinBound — verifies the .timeout error
code (not .interrupted) and that elapsed time is within the
expected window
- reusableClientSurvivesIdempotentTimeout — same XPCClient instance
survives a timeout and can complete a follow-up send
- lateReplyAfterIdempotentTimeoutIsIgnoredCleanly — server replies
after the client has timed out; subsequent send still works
- plainSendCompletesWhenServerReplies — happy path
- plainSendIgnoresCancellationAfterDispatch — Task.cancel() after
dispatch must NOT short-circuit; the task waits for the reply
- plainSendHonorsCancellationBeforeDispatch — pre-dispatch
cancellation surfaces CancellationError
What this commit does not address:
- No idempotency token or recovery query (Codex's third suggestion).
This commit prevents the unsafe combination at the API; it does not
give callers a way to safely time out a mutating request and then
ask the daemon "did it actually commit?".
- Reusable ContainerClient/NetworkClient mutating calls now have no
timeout (correctly so, per the new contract). Wedged-daemon
scenarios will hang those callers indefinitely; the user-visible
workaround (`launchctl bootout`) remains the only escape today.
Both are reasonable follow-ups but out of scope for closing the freeze
regression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Type of Change
Motivation and Context
[Why is this change needed?]
Testing
Summary
container --help,container help, andcontainer(no args) hang indefinitely on systems wherecom.apple.container.apiserveris dead, wedged, or stale-registered in launchd. This matches the symptom pattern in #1329, #798, and #621.This PR fixes both the immediate symptom and the deeper XPC-timeout defect that makes the hang unbounded. Three atomic commits:
docs/internal/help-freeze-analysis.mdApplication.swift,HelpCommand.swift,DefaultCommand.swiftXPCClient.sendtimeout actually cancel pending replies — replaces the broken TaskGroup pattern with a single-resume gateWhy two commits, not one
There are two independent defects:
Defect A (symptom).
Application.main's--helpcatch block callscreatePluginLoader()→ClientHealthCheck.ping()to fetchappRoot/installRoot/logRootpurely soprintModifiedHelpTextcan enrich help with plugin commands. The ping is structurally unnecessary —PluginLoader.alterCLIHelpTextonly readspluginDirectoriesandpluginFactories.Defect B (depth).
XPCClient.send'sresponseTimeoutdoesn't actually unblock the function. It racesTask.sleepagainst the XPC reply inside awithThrowingTaskGroup, but the structured-concurrency cleanup awaits the XPC child task, which is suspended in awithCheckedThrowingContinuationthat only resumes when the Cxpc_connection_send_message_with_replycallback fires. Cancelling a Swift Task does not cancel the C call. If the apiserver is wedged with no reply and no connection invalidation, the child never resumes and the group never returns — regardless of the supplied timeout.Defect A is the freeze you see today on
--help. Defect B is the reason the freeze is indefinite rather than 10s, and it affects every CLI path that callsXPCClient.send(the sevenClientHealthCheck.pingcallers plus most ofSources/Services/ContainerAPIService/Client/).If you'd prefer to land just one of the two, please say so — the commits are independent.
Why the simpler XPC patch wasn't used
The minimal patch for Defect B would have been to wrap the continuation in
withTaskCancellationHandlerand callxpc_connection_cancelinonCancel. That works for one-shot clients (ClientHealthCheck,ClientImage,ClientKernel,ClientVolume,ClientDiskUsage) but would silently brick the long-lived reusable clients in:Sources/Services/ContainerAPIService/Client/ContainerClient.swift:36Sources/Services/ContainerAPIService/Client/NetworkClient.swift:56xpc_connection_cancelis irreversible — every subsequentsendon a cancelled connection fails withXPC_ERROR_CONNECTION_INVALID. So we use a single-resume gate (ResumptionState) instead. The XPC reply callback is allowed to fire whenever it wants; we just stop waiting for it after timeout/cancel. The connection stays valid for the next send.Known regression in this PR
Plugin enrichment in
--help/help/ no-args output is removed. The output is the originalOVERVIEW: ... USAGE: ... CONTAINER SUBCOMMANDS: ...block, without a trailingPLUGINS:section. We chose this minimal-blast-radius patch over a "filesystem-only plugin discovery" refactor so the freeze fix isn't entangled with API design questions onPluginLoader. The follow-up shape is sketched indocs/internal/help-freeze-analysis.md— happy to do it in this PR or as a follow-up; let us know your preference.Verification
swift build --product container— cleanmake test— 366/366 unit tests passmake swift-fmt-check— cleanManual smoke test on macOS 26 with
com.apple.container.apiserverregistered with launchd but inactive (active count = 0,state = spawn scheduled):All exit 0 immediately. Before this PR these would hang indefinitely.
What this PR explicitly does not fix
ClientHealthCheck.ping's 60s defaultxpcRegistrationTimeout(all current call sites override to 2–10s)ContainerClient/NetworkClientcallingXPCClient.sendwithout a timeoutlaunchctl bootout-style recovery as a built-in CLI commandThese are all viable follow-ups. See the analysis doc for the full list.
Question for reviewers
I'd like to add a unit test that injects a non-firing reply into an
XPCClientto lock in both the timeout behavior and the reusable-client guarantee. Would you prefer it in this PR, or as a follow-up?Workaround for affected users (until merged)
Full reasoning, the call-chain trace, the audit of affected XPC callers, and the design tradeoffs are in
docs/internal/help-freeze-analysis.md. The doc is intentionally written for adversarial review — please push back on anything that doesn't hold up.