Skip to content

Fix race condition on SandboxService.waiters#1289

Merged
JaewonHur merged 2 commits intoapple:mainfrom
JaewonHur:fix-hang-2
Mar 5, 2026
Merged

Fix race condition on SandboxService.waiters#1289
JaewonHur merged 2 commits intoapple:mainfrom
JaewonHur:fix-hang-2

Conversation

@JaewonHur
Copy link
Contributor

@JaewonHur JaewonHur commented Mar 4, 2026

This PR fixes #1277.

SandboxService.waiters had a consistency issue (not exactly race).
SandboxService.wait XPC can be executed on arbitrary id, and it will hang forever if no other handler resumes it. Without knowing this internal, the high level entity can run into this issue, and deadlock.

This PR simplifies the mental model: SandboxService.waiters[id]: ExitWaiter(continuations, exitCode) can only be in three states: i) non-existing, ii) existing with nil exitCode, and iii) existing with concrete exitCode.

If it is non-existing, no handler has been registered to resume it later. If existing with nil exitCode, It is guaranteed the registered continuations will be resumed later with a concrete exitCode. Finally, if already a concrete exitCode, a handler has been registered, and already resumed (with that exitCode).

Thus, SandboxService.wait should return immediately if waiters[id] is non-existing or existing with a concrete exitCode (as no handler will resume it later). It should only block when waiters[id] is existing with nil exitCode as it is guaranteed to be resumed later. By doing so, we can guarantee there is no deadlock at all.

For that this PR does followings:

  1. Introduce ExitMonitor class to updates continuations and exitCode all together atomically. Initially, state variable saved the exitCode, but it cannot be tied with continuations as they are protected by different primitives (i.e., lock and actor).
  2. Gather waiters related operations into a single actor method, guaranteeing those are performed atomically under actor protection---i.e., we actually don't need Mutex here.
  3. Ensure initialized waiters are released (i.e., resumed) later (under any possible circumstances).
  4. Move process.wait after process.start in io.handleProcess to run SandboxService.wait only after the waiters[id] is initialized.

By doing fourth step, we can guarantee SandboxService.wait can meet only one of two following ExitMonitor state: i) existing with nil exitCode, or ii) existing with concrete exitCode (in case the process exited too early). In both cases, exitCode is preserved and returned.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Motivation and Context

[Why is this change needed?]

Testing

  • Tested locally
  • Added/updated tests
  • Added/updated docs

cc.resume(returning: ExitStatus(exitCode: -1))
let (added, exitCode) = self.addWaiter(id: id, cont: cc)
if !added {
cc.resume(returning: ExitStatus(exitCode: exitCode ?? -1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat idea

@JaewonHur JaewonHur merged commit 0557b83 into apple:main Mar 5, 2026
4 of 6 checks passed
saehejkang pushed a commit to saehejkang/container that referenced this pull request Mar 6, 2026
This PR fixes apple#1277.

`SandboxService.waiters` had a consistency issue (not exactly race).
`SandboxService.wait` XPC can be executed on arbitrary `id`, and it will
hang forever if no other handler resumes it. Without knowing this
internal, the high level entity can run into this issue, and deadlock.

This PR simplifies the mental model: **`SandboxService.waiters[id]:
ExitWaiter(continuations, exitCode)` can only be in three states: i)
non-existing, ii) existing with nil `exitCode`, and iii) existing with
concrete `exitCode`.**

**If it is non-existing, no handler has been registered to resume it
later. If existing with nil `exitCode`, It is guaranteed the registered
`continuations` will be resumed later with a concrete `exitCode`.
Finally, if already a concrete `exitCode`, a handler has been
registered, and already resumed (with that `exitCode`).**

Thus, `SandboxService.wait` should return immediately if `waiters[id]`
is non-existing or existing with a concrete `exitCode` (as no handler
will resume it later). It should only block when `waiters[id]` is
existing with nil `exitCode` as it is guaranteed to be resumed later. By
doing so, we can guarantee there is no deadlock at all.

For that this PR does followings:
1. Introduce `ExitMonitor` class to updates `continuations` and
`exitCode` all together atomically. Initially, `state` variable saved
the `exitCode`, but it cannot be tied with `continuations` as they are
protected by different primitives (i.e., lock and actor).
2. Gather `waiters` related operations into a single actor method,
guaranteeing those are performed atomically under actor
protection---i.e., we actually don't need Mutex here.
3. Ensure initialized `waiters` are released (i.e., resumed) later
(under any possible circumstances).
4. Move `process.wait` after `process.start` in `io.handleProcess` to
run `SandboxService.wait` only after the `waiters[id]` is initialized.

By doing fourth step, we can guarantee `SandboxService.wait` can meet
only one of two following `ExitMonitor` state: i) existing with nil
`exitCode`, or ii) existing with concrete `exitCode` (in case the
process exited too early). In both cases, `exitCode` is preserved and
returned.

## Type of Change
- [X] Bug fix
- [ ] New feature  
- [ ] Breaking change
- [ ] Documentation update

## Motivation and Context
[Why is this change needed?]

## Testing
- [X] Tested locally
- [ ] Added/updated tests
- [ ] Added/updated docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: container run -it <invalid executable> hangs

2 participants