fix: multi-tab concurrency with concurrent skill installs + create lock#7
fix: multi-tab concurrency with concurrent skill installs + create lock#7Jing-yilin merged 8 commits intomainfrom
Conversation
Replace the installing mutex with a lighter concurrency model: - Remove 'installing' poolState (was blocking entire sandbox per tab) - Add 'creating' poolState (only for cold sandbox creation) - Replace claimSandbox mutation with getSandbox query (read-only, no lock) - Add acquireCreateLock mutation (atomic check + placeholder insert) - Install path: no lock needed, multiple tabs can install different skills concurrently (each uploads to separate disk directory) - Cold create path: acquireCreateLock ensures only one tab creates, others see 'creating' status and wait - Remove dangerous dedup from create mutation Scenarios now handled correctly: - Two tabs open same skill (active): both take instant path - Two tabs open different skills (active): both install concurrently - Two tabs cold-create (new user): only first creates, second waits - Tab opens while another installs: reads active sandbox, proceeds Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Jing-yilin
left a comment
There was a problem hiding this comment.
Codex Review
[high] Waiting tabs never resume after the creator finishes
handleLaunch() returns as soon as getSandbox reports { status: "creating" } (src/app/[...skillPath]/page.tsx:175-179). In that path the component stays in phase === "launching", but nothing re-invokes handleLaunch when the query later flips to { status: "found" }. The comment says the query will auto-update, but that only rerenders the component; it does not restart launch because the auto-launch effect is gated on phase === "config" (src/app/[...skillPath]/page.tsx:289-306). Result: the second tab sits on the spinner indefinitely unless the user manually retries.
Refs: src/app/[...skillPath]/page.tsx:175, src/app/[...skillPath]/page.tsx:289
[high] The retry branch can spin forever on stale query/auth state
At src/app/[...skillPath]/page.tsx:185-190, both !lock and lock.status === "exists" recurse via void handleLaunch(config). That new call uses the same render's closed-over userSandbox value, so it never observes the eventual query update. This is especially easy to hit while useQuery(api.sandboxes.getSandbox, ...) is still undefined or still skipped because isAuthenticated has not flipped yet (src/app/[...skillPath]/page.tsx:49-52). A transient auth/network failure also lands in !lock, so we can end up hammering acquireCreateLock in a loop instead of waiting for auth/query state to settle.
Refs: src/app/[...skillPath]/page.tsx:49, src/app/[...skillPath]/page.tsx:185
[high] Stale pending placeholders now block launches for up to two days
getSandbox() treats any pending-* record as { status: "creating" } regardless of poolState (convex/sandboxes.ts:144-148). The cleanup cron no longer deletes stale pending records; it first marks poolState === "creating" as stopped and then keeps stopped records for two days (convex/crons.ts:30-39). If the creating tab crashes after acquireCreateLock() inserts the placeholder, every future launch will keep seeing "creating" until that stopped placeholder finally ages out.
Refs: convex/sandboxes.ts:144, convex/crons.ts:30
Verification note: bun test src/__tests__/lib/sandbox/types.test.ts passes on the PR head in a temp checkout. I did not use the full bun test run as review signal because the suite currently fails for unrelated harness/env issues in this repo (import.meta.glob, missing DOM globals, vi.mocked).
1. Waiting tabs auto-resume: useEffect watches userSandbox query and re-triggers handleLaunch when status changes from 'creating' to 'found'. No more infinite spinner on second tab. 2. Remove recursive handleLaunch retry: 'exists'/'creating' from acquireCreateLock now sets state and returns. The useEffect handles re-triggering when the query updates. No more spin loop. 3. Stale pending placeholders expire after 2 minutes: getSandbox and acquireCreateLock ignore/delete pending records older than 2min. Crashed tabs no longer block future launches. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Jing-yilin
left a comment
There was a problem hiding this comment.
[high] Concurrent installs still race on shared sandbox config
The new install path explicitly allows multiple tabs to call installSkill() against the same sandbox with no lock (src/app/[...skillPath]/page.tsx:123-165). But installSkill() still rewrites the sandbox-global /home/daytona/.hermes/.env and /home/daytona/.hermes/config.yaml on every launch (src/lib/sandbox/daytona.ts:282-288), and same-skill launches still upload into the same skill directory (src/lib/sandbox/daytona.ts:276-297). That means two overlapping launches can still clobber each other: different model/provider/API-key selections become last-writer-wins, and even the same skill can be uploading into the same path concurrently. So the mutex removal fixed the tab-level blocking, but it also removed the only serialization around shared sandbox state.
When multiple tabs install concurrently, .env/config.yaml writes
could race. Since all tabs share the same user settings, skip
config writes when configHash matches (content would be identical).
Only write on first install or when user changes provider/model/key.
installSkill() now accepts { skipConfigWrite: boolean } option.
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Jing-yilin
left a comment
There was a problem hiding this comment.
Codex Review
[high] Waiting tabs never recover after a stale create lock expires
When handleLaunch() sees userSandbox.status === "creating" or acquireCreateLock() returns "creating", it parks the tab in phase === "launching" and returns (src/app/[...skillPath]/page.tsx:175-199). The only retry path is the effect at src/app/[...skillPath]/page.tsx:273-280, and that only re-runs launch when userSandbox.status === "found". After the creator crashes, getSandbox() starts returning null once the placeholder ages past the 2 minute cutoff (convex/sandboxes.ts:145-152), but nothing retries acquireCreateLock() from that null state. Result: tabs that were already waiting keep spinning indefinitely even though the stale lock has effectively expired; only a manual retry or a brand new tab can recover.
[medium] Concurrent installs can drop each other from installedSkills
The hotswap path still derives installedSkills from the stale sandbox.installedSkills snapshot captured before the upload finishes, then sends that full array back in updatePoolState() (src/app/[...skillPath]/page.tsx:156-164). On the server side, updatePoolState() replaces the whole field rather than unioning with the current document value (convex/sandboxes.ts:271-284). So if two tabs concurrently install different new skills, the last mutation wins and removes the other skill from the DB record. The sandbox on disk still has both installs, but future launches of the dropped skill miss the instant path and the dashboard shows an incomplete installed-skill list.
Jing-yilin
left a comment
There was a problem hiding this comment.
Codex Review
[high] Waiting tabs never recover after a stale create lock expires
When handleLaunch() sees userSandbox.status === "creating" or acquireCreateLock() returns "creating", it parks the tab in phase === "launching" and returns (src/app/[...skillPath]/page.tsx:175-199). The only retry path is the effect at src/app/[...skillPath]/page.tsx:273-280, and that only re-runs launch when userSandbox.status === "found". After the creator crashes, getSandbox() starts returning null once the placeholder ages past the 2 minute cutoff (convex/sandboxes.ts:145-152), but nothing retries acquireCreateLock() from that null state. Result: tabs that were already waiting keep spinning indefinitely even though the stale lock has effectively expired; only a manual retry or a brand new tab can recover.
[medium] Concurrent installs can drop each other from installedSkills
The hotswap path still derives installedSkills from the stale sandbox.installedSkills snapshot captured before the upload finishes, then sends that full array back in updatePoolState() (src/app/[...skillPath]/page.tsx:156-164). On the server side, updatePoolState() replaces the whole field rather than unioning with the current document value (convex/sandboxes.ts:271-284). So if two tabs concurrently install different new skills, the last mutation wins and removes the other skill from the DB record. The sandbox on disk still has both installs, but future launches of the dropped skill miss the instant path and the dashboard shows an incomplete installed-skill list.
1. Waiting tabs now also retry when userSandbox becomes null (stale pending lock expired after 2min). The useEffect skips only when status is still 'creating'; null or 'found' both trigger retry. 2. Replace client-side installedSkills array replacement with server-side addInstalledSkill mutation that atomically reads current array and appends. Prevents concurrent installs from dropping each other's entries. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Jing-yilin
left a comment
There was a problem hiding this comment.
Codex Review
[high] Placeholder handoff briefly advertises "no sandbox", so waiting tabs can cold-create a second sandbox
After a successful cold create, the creator deletes the pending-* record before it inserts the real sandbox record (src/app/[...skillPath]/page.tsx:247-249). Waiting tabs now deliberately retry whenever userSandbox becomes null (src/app/[...skillPath]/page.tsx:274-281). In that handoff window getSandbox() returns null, so a waiter re-enters handleLaunch(), acquireCreateLock() sees neither a real sandbox nor a fresh pending one (convex/sandboxes.ts:182-214), and it can acquire a brand-new create lock. That recreates the duplicate/orphaned-sandbox race this PR is trying to eliminate. The pending->real handoff needs to be atomic, or at least insert the real record before removing the placeholder.
[high] The new 2-minute stale-lock cutoff can expire during a healthy cold create
getSandbox() and acquireCreateLock() now treat any placeholder older than 2 minutes as stale (convex/sandboxes.ts:146-149, convex/sandboxes.ts:187-199), and waiting tabs retry as soon as the query becomes null (src/app/[...skillPath]/page.tsx:274-281). But createHermesSandbox() can legitimately run longer than that: the snapshot create path allows 120s, the fallback create path allows 300s, and waitForHealth() adds another 120s before counting upload/setup time (src/lib/sandbox/daytona.ts:109-153, src/lib/sandbox/daytona.ts:220, src/lib/sandbox/daytona.ts:352-364). Under a slow or fallback create, other tabs will misclassify the live placeholder as stale, delete it, and start a second sandbox while the first one is still booting. The create lock needs either a much longer TTL or a heartbeat/refresh mechanism while creation is in flight.
1. Insert real sandbox record BEFORE deleting placeholder to prevent null window. Waiting tabs see the real record immediately via getSandbox query, never a gap where both are absent. 2. Increase stale pending lock TTL from 2min to 5min. Cold create can take up to 5min (300s fallback + health check). The old 2min cutoff could expire during a healthy create, causing duplicate sandbox creation. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Jing-yilin
left a comment
There was a problem hiding this comment.
Codex Review
[high] Failed real-record insert still reopens the duplicate-create race
After createHermesSandbox() succeeds, the new handoff path sets the UI to running, swallows any createSandboxRecord() failure, and then unconditionally deletes the pending-* placeholder (src/app/[...skillPath]/page.tsx:241-265). If that mutation fails because of a transient auth/network problem, the Daytona sandbox is already alive but there is no real Convex record for it. Waiting tabs now explicitly retry whenever userSandbox becomes null (src/app/[...skillPath]/page.tsx:275-282), so as soon as the placeholder is removed they can acquire a fresh create lock and cold-create a second sandbox, orphaning the first one. This path needs to keep the placeholder until the real record is durably written, or tear the sandbox down when the insert fails.
Refs: src/app/[...skillPath]/page.tsx:241, src/app/[...skillPath]/page.tsx:248, src/app/[...skillPath]/page.tsx:264, src/app/[...skillPath]/page.tsx:276
…round 5) If the real record insert fails after createHermesSandbox succeeds, destroy the sandbox immediately instead of leaving it orphaned. Only delete the placeholder after the real record is durably written. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Jing-yilin
left a comment
There was a problem hiding this comment.
[high] Mixed-config concurrent installs can still publish the wrong configHash
When a launch starts from a sandbox whose hash already matches, we set sameConfig and call installSkill(..., { skipConfigWrite: true }) (src/app/[...skillPath]/page.tsx:93, src/app/[...skillPath]/page.tsx:147; src/lib/sandbox/daytona.ts:283-293). But the success path still unconditionally writes that launch's configHash back to Convex (src/app/[...skillPath]/page.tsx:157-163). So if tab A starts with the current config, tab B concurrently changes provider/model/key and rewrites the sandbox config files, and then A finishes last, Convex flips back to A's old hash even though the sandbox is still configured with B's settings. The next launch with A's settings can now take the instant path and reuse a sandbox running the wrong model/provider/key. This needs a reread/CAS before persisting the hash, or only updating configHash when this launch actually rewrote the config files.
[medium] Failed sandbox-record persistence leaves the page in a fake "running" state
After createHermesSandbox() returns we immediately set session, sandboxState = "running", and phase = "running" before trying to persist the real record (src/app/[...skillPath]/page.tsx:241-245). If createSandboxRecord() then fails, the catch only sets sandboxState/sandboxError (src/app/[...skillPath]/page.tsx:266-271). phase stays "running" and session stays populated, so the render path still shows ChatPanel instead of the retry/error UI even though we are tearing the sandbox down and have no durable record for it. That path should clear the session and move back to launching or config before surfacing the error.
1. Only update configHash in Convex when config files were actually written (skipConfigWrite=false). Prevents stale hash republication when concurrent installs with different configs overlap. 2. Clear session and reset phase to 'config' when createSandboxRecord fails. Previously the page stayed in 'running' phase showing ChatPanel while the sandbox was being destroyed. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Jing-yilin
left a comment
There was a problem hiding this comment.
[medium] Round 6 error handling now falls through to a blank screen
The new createSandboxRecord() failure path resets phase to "config" (src/app/[...skillPath]/page.tsx:266-275), but the config-phase render only covers sign-in, key loading, onboarding, userCancelled, or readyToAutoLaunch (src/app/[...skillPath]/page.tsx:421-460). In the normal auto-launch case, userCancelled.current is still false and autoLaunchLock is still set, so none of those branches render anything. That means a transient record-persist failure now drops the user onto an empty page with no visible error or retry path, instead of the launch error UI this change was trying to restore.
Keeping the page in phase === "launching" for this error, or adding an explicit config-phase fallback when sandboxState === "error", would preserve a recoverable UI.
Setting phase to 'config' on createSandboxRecord failure left the page blank because the config render path has no error branch in auto-launch mode. Stay in 'launching' phase where LaunchProgress renders the error with retry button. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Problem
The
installingmutex blocks the entire sandbox per tab, but Hermes sessions are independent -- multiple skills can run concurrently in one sandbox. This causes:createmutation dedup deletes Tab A's record → orphans Tab A's Daytona sandboxFix
Replace the
installingmutex with a lighter concurrency model:acquireCreateLockmutation (atomic)claimSandboxmutation (writes lock)getSandboxquery (read-only)Key changes
claimSandboxmutation →getSandboxquery (no write, multiple tabs read simultaneously)acquireCreateLockmutation (atomically insertspending-placeholder, returnscreatingif another tab holds it)installingpoolState →creating(only during cold sandbox creation)createmutationactivethroughoutScenarios
Tests
118 tests passing.