Skip to content

fix: multi-tab concurrency with concurrent skill installs + create lock#7

Merged
Jing-yilin merged 8 commits intomainfrom
fix/multi-tab-race
Apr 21, 2026
Merged

fix: multi-tab concurrency with concurrent skill installs + create lock#7
Jing-yilin merged 8 commits intomainfrom
fix/multi-tab-race

Conversation

@Jing-yilin
Copy link
Copy Markdown
Contributor

Problem

The installing mutex blocks the entire sandbox per tab, but Hermes sessions are independent -- multiple skills can run concurrently in one sandbox. This causes:

  1. Tab B blocked by Tab A's install lock → cold-creates a second sandbox
  2. create mutation dedup deletes Tab A's record → orphans Tab A's Daytona sandbox
  3. New user + multiple tabs → both cold-create simultaneously

Fix

Replace the installing mutex with a lighter concurrency model:

Operation Old (installing lock) New
Skill install Mutex (blocks other tabs) No lock (concurrent uploads to separate dirs)
Cold create No protection (race) acquireCreateLock mutation (atomic)
Read sandbox claimSandbox mutation (writes lock) getSandbox query (read-only)

Key changes

  • claimSandbox mutation → getSandbox query (no write, multiple tabs read simultaneously)
  • New acquireCreateLock mutation (atomically inserts pending- placeholder, returns creating if another tab holds it)
  • installing poolState → creating (only during cold sandbox creation)
  • Removed dangerous dedup from create mutation
  • Install path no longer sets/restores poolState -- sandbox stays active throughout

Scenarios

Scenario Result
Two tabs, same skill, active sandbox Both take instant path (0ms each)
Two tabs, different skills, active sandbox Both install concurrently to separate dirs
Two tabs, no sandbox (new user) First acquires lock, second sees "creating" and waits
Tab opens while another installs Reads active sandbox, proceeds normally

Tests

118 tests passing.

Replace the installing mutex with a lighter concurrency model:

- Remove 'installing' poolState (was blocking entire sandbox per tab)
- Add 'creating' poolState (only for cold sandbox creation)
- Replace claimSandbox mutation with getSandbox query (read-only, no lock)
- Add acquireCreateLock mutation (atomic check + placeholder insert)
- Install path: no lock needed, multiple tabs can install different
  skills concurrently (each uploads to separate disk directory)
- Cold create path: acquireCreateLock ensures only one tab creates,
  others see 'creating' status and wait
- Remove dangerous dedup from create mutation

Scenarios now handled correctly:
- Two tabs open same skill (active): both take instant path
- Two tabs open different skills (active): both install concurrently
- Two tabs cold-create (new user): only first creates, second waits
- Tab opens while another installs: reads active sandbox, proceeds

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
tryskills.sh Ready Ready Preview, Comment Apr 20, 2026 7:38pm

Request Review

Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review

[high] Waiting tabs never resume after the creator finishes

handleLaunch() returns as soon as getSandbox reports { status: "creating" } (src/app/[...skillPath]/page.tsx:175-179). In that path the component stays in phase === "launching", but nothing re-invokes handleLaunch when the query later flips to { status: "found" }. The comment says the query will auto-update, but that only rerenders the component; it does not restart launch because the auto-launch effect is gated on phase === "config" (src/app/[...skillPath]/page.tsx:289-306). Result: the second tab sits on the spinner indefinitely unless the user manually retries.

Refs: src/app/[...skillPath]/page.tsx:175, src/app/[...skillPath]/page.tsx:289

[high] The retry branch can spin forever on stale query/auth state

At src/app/[...skillPath]/page.tsx:185-190, both !lock and lock.status === "exists" recurse via void handleLaunch(config). That new call uses the same render's closed-over userSandbox value, so it never observes the eventual query update. This is especially easy to hit while useQuery(api.sandboxes.getSandbox, ...) is still undefined or still skipped because isAuthenticated has not flipped yet (src/app/[...skillPath]/page.tsx:49-52). A transient auth/network failure also lands in !lock, so we can end up hammering acquireCreateLock in a loop instead of waiting for auth/query state to settle.

Refs: src/app/[...skillPath]/page.tsx:49, src/app/[...skillPath]/page.tsx:185

[high] Stale pending placeholders now block launches for up to two days

getSandbox() treats any pending-* record as { status: "creating" } regardless of poolState (convex/sandboxes.ts:144-148). The cleanup cron no longer deletes stale pending records; it first marks poolState === "creating" as stopped and then keeps stopped records for two days (convex/crons.ts:30-39). If the creating tab crashes after acquireCreateLock() inserts the placeholder, every future launch will keep seeing "creating" until that stopped placeholder finally ages out.

Refs: convex/sandboxes.ts:144, convex/crons.ts:30


Verification note: bun test src/__tests__/lib/sandbox/types.test.ts passes on the PR head in a temp checkout. I did not use the full bun test run as review signal because the suite currently fails for unrelated harness/env issues in this repo (import.meta.glob, missing DOM globals, vi.mocked).

1. Waiting tabs auto-resume: useEffect watches userSandbox query and
   re-triggers handleLaunch when status changes from 'creating' to
   'found'. No more infinite spinner on second tab.

2. Remove recursive handleLaunch retry: 'exists'/'creating' from
   acquireCreateLock now sets state and returns. The useEffect handles
   re-triggering when the query updates. No more spin loop.

3. Stale pending placeholders expire after 2 minutes: getSandbox and
   acquireCreateLock ignore/delete pending records older than 2min.
   Crashed tabs no longer block future launches.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[high] Concurrent installs still race on shared sandbox config

The new install path explicitly allows multiple tabs to call installSkill() against the same sandbox with no lock (src/app/[...skillPath]/page.tsx:123-165). But installSkill() still rewrites the sandbox-global /home/daytona/.hermes/.env and /home/daytona/.hermes/config.yaml on every launch (src/lib/sandbox/daytona.ts:282-288), and same-skill launches still upload into the same skill directory (src/lib/sandbox/daytona.ts:276-297). That means two overlapping launches can still clobber each other: different model/provider/API-key selections become last-writer-wins, and even the same skill can be uploading into the same path concurrently. So the mutex removal fixed the tab-level blocking, but it also removed the only serialization around shared sandbox state.

When multiple tabs install concurrently, .env/config.yaml writes
could race. Since all tabs share the same user settings, skip
config writes when configHash matches (content would be identical).
Only write on first install or when user changes provider/model/key.

installSkill() now accepts { skipConfigWrite: boolean } option.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review

[high] Waiting tabs never recover after a stale create lock expires

When handleLaunch() sees userSandbox.status === "creating" or acquireCreateLock() returns "creating", it parks the tab in phase === "launching" and returns (src/app/[...skillPath]/page.tsx:175-199). The only retry path is the effect at src/app/[...skillPath]/page.tsx:273-280, and that only re-runs launch when userSandbox.status === "found". After the creator crashes, getSandbox() starts returning null once the placeholder ages past the 2 minute cutoff (convex/sandboxes.ts:145-152), but nothing retries acquireCreateLock() from that null state. Result: tabs that were already waiting keep spinning indefinitely even though the stale lock has effectively expired; only a manual retry or a brand new tab can recover.

[medium] Concurrent installs can drop each other from installedSkills

The hotswap path still derives installedSkills from the stale sandbox.installedSkills snapshot captured before the upload finishes, then sends that full array back in updatePoolState() (src/app/[...skillPath]/page.tsx:156-164). On the server side, updatePoolState() replaces the whole field rather than unioning with the current document value (convex/sandboxes.ts:271-284). So if two tabs concurrently install different new skills, the last mutation wins and removes the other skill from the DB record. The sandbox on disk still has both installs, but future launches of the dropped skill miss the instant path and the dashboard shows an incomplete installed-skill list.

Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review

[high] Waiting tabs never recover after a stale create lock expires

When handleLaunch() sees userSandbox.status === "creating" or acquireCreateLock() returns "creating", it parks the tab in phase === "launching" and returns (src/app/[...skillPath]/page.tsx:175-199). The only retry path is the effect at src/app/[...skillPath]/page.tsx:273-280, and that only re-runs launch when userSandbox.status === "found". After the creator crashes, getSandbox() starts returning null once the placeholder ages past the 2 minute cutoff (convex/sandboxes.ts:145-152), but nothing retries acquireCreateLock() from that null state. Result: tabs that were already waiting keep spinning indefinitely even though the stale lock has effectively expired; only a manual retry or a brand new tab can recover.

[medium] Concurrent installs can drop each other from installedSkills

The hotswap path still derives installedSkills from the stale sandbox.installedSkills snapshot captured before the upload finishes, then sends that full array back in updatePoolState() (src/app/[...skillPath]/page.tsx:156-164). On the server side, updatePoolState() replaces the whole field rather than unioning with the current document value (convex/sandboxes.ts:271-284). So if two tabs concurrently install different new skills, the last mutation wins and removes the other skill from the DB record. The sandbox on disk still has both installs, but future launches of the dropped skill miss the instant path and the dashboard shows an incomplete installed-skill list.

1. Waiting tabs now also retry when userSandbox becomes null (stale
   pending lock expired after 2min). The useEffect skips only when
   status is still 'creating'; null or 'found' both trigger retry.

2. Replace client-side installedSkills array replacement with
   server-side addInstalledSkill mutation that atomically reads
   current array and appends. Prevents concurrent installs from
   dropping each other's entries.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review

[high] Placeholder handoff briefly advertises "no sandbox", so waiting tabs can cold-create a second sandbox

After a successful cold create, the creator deletes the pending-* record before it inserts the real sandbox record (src/app/[...skillPath]/page.tsx:247-249). Waiting tabs now deliberately retry whenever userSandbox becomes null (src/app/[...skillPath]/page.tsx:274-281). In that handoff window getSandbox() returns null, so a waiter re-enters handleLaunch(), acquireCreateLock() sees neither a real sandbox nor a fresh pending one (convex/sandboxes.ts:182-214), and it can acquire a brand-new create lock. That recreates the duplicate/orphaned-sandbox race this PR is trying to eliminate. The pending->real handoff needs to be atomic, or at least insert the real record before removing the placeholder.

[high] The new 2-minute stale-lock cutoff can expire during a healthy cold create

getSandbox() and acquireCreateLock() now treat any placeholder older than 2 minutes as stale (convex/sandboxes.ts:146-149, convex/sandboxes.ts:187-199), and waiting tabs retry as soon as the query becomes null (src/app/[...skillPath]/page.tsx:274-281). But createHermesSandbox() can legitimately run longer than that: the snapshot create path allows 120s, the fallback create path allows 300s, and waitForHealth() adds another 120s before counting upload/setup time (src/lib/sandbox/daytona.ts:109-153, src/lib/sandbox/daytona.ts:220, src/lib/sandbox/daytona.ts:352-364). Under a slow or fallback create, other tabs will misclassify the live placeholder as stale, delete it, and start a second sandbox while the first one is still booting. The create lock needs either a much longer TTL or a heartbeat/refresh mechanism while creation is in flight.

1. Insert real sandbox record BEFORE deleting placeholder to prevent
   null window. Waiting tabs see the real record immediately via
   getSandbox query, never a gap where both are absent.

2. Increase stale pending lock TTL from 2min to 5min. Cold create
   can take up to 5min (300s fallback + health check). The old 2min
   cutoff could expire during a healthy create, causing duplicate
   sandbox creation.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review

[high] Failed real-record insert still reopens the duplicate-create race

After createHermesSandbox() succeeds, the new handoff path sets the UI to running, swallows any createSandboxRecord() failure, and then unconditionally deletes the pending-* placeholder (src/app/[...skillPath]/page.tsx:241-265). If that mutation fails because of a transient auth/network problem, the Daytona sandbox is already alive but there is no real Convex record for it. Waiting tabs now explicitly retry whenever userSandbox becomes null (src/app/[...skillPath]/page.tsx:275-282), so as soon as the placeholder is removed they can acquire a fresh create lock and cold-create a second sandbox, orphaning the first one. This path needs to keep the placeholder until the real record is durably written, or tear the sandbox down when the insert fails.

Refs: src/app/[...skillPath]/page.tsx:241, src/app/[...skillPath]/page.tsx:248, src/app/[...skillPath]/page.tsx:264, src/app/[...skillPath]/page.tsx:276

…round 5)

If the real record insert fails after createHermesSandbox succeeds,
destroy the sandbox immediately instead of leaving it orphaned.
Only delete the placeholder after the real record is durably written.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[high] Mixed-config concurrent installs can still publish the wrong configHash

When a launch starts from a sandbox whose hash already matches, we set sameConfig and call installSkill(..., { skipConfigWrite: true }) (src/app/[...skillPath]/page.tsx:93, src/app/[...skillPath]/page.tsx:147; src/lib/sandbox/daytona.ts:283-293). But the success path still unconditionally writes that launch's configHash back to Convex (src/app/[...skillPath]/page.tsx:157-163). So if tab A starts with the current config, tab B concurrently changes provider/model/key and rewrites the sandbox config files, and then A finishes last, Convex flips back to A's old hash even though the sandbox is still configured with B's settings. The next launch with A's settings can now take the instant path and reuse a sandbox running the wrong model/provider/key. This needs a reread/CAS before persisting the hash, or only updating configHash when this launch actually rewrote the config files.

[medium] Failed sandbox-record persistence leaves the page in a fake "running" state

After createHermesSandbox() returns we immediately set session, sandboxState = "running", and phase = "running" before trying to persist the real record (src/app/[...skillPath]/page.tsx:241-245). If createSandboxRecord() then fails, the catch only sets sandboxState/sandboxError (src/app/[...skillPath]/page.tsx:266-271). phase stays "running" and session stays populated, so the render path still shows ChatPanel instead of the retry/error UI even though we are tearing the sandbox down and have no durable record for it. That path should clear the session and move back to launching or config before surfacing the error.

1. Only update configHash in Convex when config files were actually
   written (skipConfigWrite=false). Prevents stale hash republication
   when concurrent installs with different configs overlap.

2. Clear session and reset phase to 'config' when createSandboxRecord
   fails. Previously the page stayed in 'running' phase showing
   ChatPanel while the sandbox was being destroyed.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] Round 6 error handling now falls through to a blank screen

The new createSandboxRecord() failure path resets phase to "config" (src/app/[...skillPath]/page.tsx:266-275), but the config-phase render only covers sign-in, key loading, onboarding, userCancelled, or readyToAutoLaunch (src/app/[...skillPath]/page.tsx:421-460). In the normal auto-launch case, userCancelled.current is still false and autoLaunchLock is still set, so none of those branches render anything. That means a transient record-persist failure now drops the user onto an empty page with no visible error or retry path, instead of the launch error UI this change was trying to restore.

Keeping the page in phase === "launching" for this error, or adding an explicit config-phase fallback when sandboxState === "error", would preserve a recoverable UI.

Setting phase to 'config' on createSandboxRecord failure left the
page blank because the config render path has no error branch in
auto-launch mode. Stay in 'launching' phase where LaunchProgress
renders the error with retry button.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@Jing-yilin Jing-yilin merged commit 74f1693 into main Apr 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant