fix: multi-tab concurrency with concurrent skill installs + create lock by Jing-yilin · Pull Request #7 · ReScienceLab/TrySkills

Jing-yilin · 2026-04-20T18:21:44Z

Problem

The installing mutex blocks the entire sandbox per tab, but Hermes sessions are independent -- multiple skills can run concurrently in one sandbox. This causes:

Tab B blocked by Tab A's install lock → cold-creates a second sandbox
create mutation dedup deletes Tab A's record → orphans Tab A's Daytona sandbox
New user + multiple tabs → both cold-create simultaneously

Fix

Replace the installing mutex with a lighter concurrency model:

Operation	Old (installing lock)	New
Skill install	Mutex (blocks other tabs)	No lock (concurrent uploads to separate dirs)
Cold create	No protection (race)	`acquireCreateLock` mutation (atomic)
Read sandbox	`claimSandbox` mutation (writes lock)	`getSandbox` query (read-only)

Key changes

claimSandbox mutation → getSandbox query (no write, multiple tabs read simultaneously)
New acquireCreateLock mutation (atomically inserts pending- placeholder, returns creating if another tab holds it)
installing poolState → creating (only during cold sandbox creation)
Removed dangerous dedup from create mutation
Install path no longer sets/restores poolState -- sandbox stays active throughout

Scenarios

Scenario	Result
Two tabs, same skill, active sandbox	Both take instant path (0ms each)
Two tabs, different skills, active sandbox	Both install concurrently to separate dirs
Two tabs, no sandbox (new user)	First acquires lock, second sees "creating" and waits
Tab opens while another installs	Reads active sandbox, proceeds normally

Tests

118 tests passing.

Replace the installing mutex with a lighter concurrency model: - Remove 'installing' poolState (was blocking entire sandbox per tab) - Add 'creating' poolState (only for cold sandbox creation) - Replace claimSandbox mutation with getSandbox query (read-only, no lock) - Add acquireCreateLock mutation (atomic check + placeholder insert) - Install path: no lock needed, multiple tabs can install different skills concurrently (each uploads to separate disk directory) - Cold create path: acquireCreateLock ensures only one tab creates, others see 'creating' status and wait - Remove dangerous dedup from create mutation Scenarios now handled correctly: - Two tabs open same skill (active): both take instant path - Two tabs open different skills (active): both install concurrently - Two tabs cold-create (new user): only first creates, second waits - Tab opens while another installs: reads active sandbox, proceeds Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

vercel · 2026-04-20T18:21:48Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
tryskills.sh	Ready	Preview, Comment	Apr 20, 2026 7:38pm

Jing-yilin

Codex Review

[high] Waiting tabs never resume after the creator finishes

handleLaunch() returns as soon as getSandbox reports { status: "creating" } (src/app/[...skillPath]/page.tsx:175-179). In that path the component stays in phase === "launching", but nothing re-invokes handleLaunch when the query later flips to { status: "found" }. The comment says the query will auto-update, but that only rerenders the component; it does not restart launch because the auto-launch effect is gated on phase === "config" (src/app/[...skillPath]/page.tsx:289-306). Result: the second tab sits on the spinner indefinitely unless the user manually retries.

Refs: src/app/[...skillPath]/page.tsx:175, src/app/[...skillPath]/page.tsx:289

[high] The retry branch can spin forever on stale query/auth state

At src/app/[...skillPath]/page.tsx:185-190, both !lock and lock.status === "exists" recurse via void handleLaunch(config). That new call uses the same render's closed-over userSandbox value, so it never observes the eventual query update. This is especially easy to hit while useQuery(api.sandboxes.getSandbox, ...) is still undefined or still skipped because isAuthenticated has not flipped yet (src/app/[...skillPath]/page.tsx:49-52). A transient auth/network failure also lands in !lock, so we can end up hammering acquireCreateLock in a loop instead of waiting for auth/query state to settle.

Refs: src/app/[...skillPath]/page.tsx:49, src/app/[...skillPath]/page.tsx:185

[high] Stale pending placeholders now block launches for up to two days

getSandbox() treats any pending-* record as { status: "creating" } regardless of poolState (convex/sandboxes.ts:144-148). The cleanup cron no longer deletes stale pending records; it first marks poolState === "creating" as stopped and then keeps stopped records for two days (convex/crons.ts:30-39). If the creating tab crashes after acquireCreateLock() inserts the placeholder, every future launch will keep seeing "creating" until that stopped placeholder finally ages out.

Refs: convex/sandboxes.ts:144, convex/crons.ts:30

Verification note: bun test src/__tests__/lib/sandbox/types.test.ts passes on the PR head in a temp checkout. I did not use the full bun test run as review signal because the suite currently fails for unrelated harness/env issues in this repo (import.meta.glob, missing DOM globals, vi.mocked).

1. Waiting tabs auto-resume: useEffect watches userSandbox query and re-triggers handleLaunch when status changes from 'creating' to 'found'. No more infinite spinner on second tab. 2. Remove recursive handleLaunch retry: 'exists'/'creating' from acquireCreateLock now sets state and returns. The useEffect handles re-triggering when the query updates. No more spin loop. 3. Stale pending placeholders expire after 2 minutes: getSandbox and acquireCreateLock ignore/delete pending records older than 2min. Crashed tabs no longer block future launches. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Jing-yilin

[high] Concurrent installs still race on shared sandbox config

The new install path explicitly allows multiple tabs to call installSkill() against the same sandbox with no lock (src/app/[...skillPath]/page.tsx:123-165). But installSkill() still rewrites the sandbox-global /home/daytona/.hermes/.env and /home/daytona/.hermes/config.yaml on every launch (src/lib/sandbox/daytona.ts:282-288), and same-skill launches still upload into the same skill directory (src/lib/sandbox/daytona.ts:276-297). That means two overlapping launches can still clobber each other: different model/provider/API-key selections become last-writer-wins, and even the same skill can be uploading into the same path concurrently. So the mutex removal fixed the tab-level blocking, but it also removed the only serialization around shared sandbox state.

When multiple tabs install concurrently, .env/config.yaml writes could race. Since all tabs share the same user settings, skip config writes when configHash matches (content would be identical). Only write on first install or when user changes provider/model/key. installSkill() now accepts { skipConfigWrite: boolean } option. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Jing-yilin

Codex Review

[high] Waiting tabs never recover after a stale create lock expires

When handleLaunch() sees userSandbox.status === "creating" or acquireCreateLock() returns "creating", it parks the tab in phase === "launching" and returns (src/app/[...skillPath]/page.tsx:175-199). The only retry path is the effect at src/app/[...skillPath]/page.tsx:273-280, and that only re-runs launch when userSandbox.status === "found". After the creator crashes, getSandbox() starts returning null once the placeholder ages past the 2 minute cutoff (convex/sandboxes.ts:145-152), but nothing retries acquireCreateLock() from that null state. Result: tabs that were already waiting keep spinning indefinitely even though the stale lock has effectively expired; only a manual retry or a brand new tab can recover.

[medium] Concurrent installs can drop each other from `installedSkills`

The hotswap path still derives installedSkills from the stale sandbox.installedSkills snapshot captured before the upload finishes, then sends that full array back in updatePoolState() (src/app/[...skillPath]/page.tsx:156-164). On the server side, updatePoolState() replaces the whole field rather than unioning with the current document value (convex/sandboxes.ts:271-284). So if two tabs concurrently install different new skills, the last mutation wins and removes the other skill from the DB record. The sandbox on disk still has both installs, but future launches of the dropped skill miss the instant path and the dashboard shows an incomplete installed-skill list.

Jing-yilin

Codex Review

[high] Waiting tabs never recover after a stale create lock expires

When handleLaunch() sees userSandbox.status === "creating" or acquireCreateLock() returns "creating", it parks the tab in phase === "launching" and returns (src/app/[...skillPath]/page.tsx:175-199). The only retry path is the effect at src/app/[...skillPath]/page.tsx:273-280, and that only re-runs launch when userSandbox.status === "found". After the creator crashes, getSandbox() starts returning null once the placeholder ages past the 2 minute cutoff (convex/sandboxes.ts:145-152), but nothing retries acquireCreateLock() from that null state. Result: tabs that were already waiting keep spinning indefinitely even though the stale lock has effectively expired; only a manual retry or a brand new tab can recover.

[medium] Concurrent installs can drop each other from `installedSkills`

The hotswap path still derives installedSkills from the stale sandbox.installedSkills snapshot captured before the upload finishes, then sends that full array back in updatePoolState() (src/app/[...skillPath]/page.tsx:156-164). On the server side, updatePoolState() replaces the whole field rather than unioning with the current document value (convex/sandboxes.ts:271-284). So if two tabs concurrently install different new skills, the last mutation wins and removes the other skill from the DB record. The sandbox on disk still has both installs, but future launches of the dropped skill miss the instant path and the dashboard shows an incomplete installed-skill list.

1. Waiting tabs now also retry when userSandbox becomes null (stale pending lock expired after 2min). The useEffect skips only when status is still 'creating'; null or 'found' both trigger retry. 2. Replace client-side installedSkills array replacement with server-side addInstalledSkill mutation that atomically reads current array and appends. Prevents concurrent installs from dropping each other's entries. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Jing-yilin

Codex Review

[high] Placeholder handoff briefly advertises "no sandbox", so waiting tabs can cold-create a second sandbox

After a successful cold create, the creator deletes the pending-* record before it inserts the real sandbox record (src/app/[...skillPath]/page.tsx:247-249). Waiting tabs now deliberately retry whenever userSandbox becomes null (src/app/[...skillPath]/page.tsx:274-281). In that handoff window getSandbox() returns null, so a waiter re-enters handleLaunch(), acquireCreateLock() sees neither a real sandbox nor a fresh pending one (convex/sandboxes.ts:182-214), and it can acquire a brand-new create lock. That recreates the duplicate/orphaned-sandbox race this PR is trying to eliminate. The pending->real handoff needs to be atomic, or at least insert the real record before removing the placeholder.

[high] The new 2-minute stale-lock cutoff can expire during a healthy cold create

getSandbox() and acquireCreateLock() now treat any placeholder older than 2 minutes as stale (convex/sandboxes.ts:146-149, convex/sandboxes.ts:187-199), and waiting tabs retry as soon as the query becomes null (src/app/[...skillPath]/page.tsx:274-281). But createHermesSandbox() can legitimately run longer than that: the snapshot create path allows 120s, the fallback create path allows 300s, and waitForHealth() adds another 120s before counting upload/setup time (src/lib/sandbox/daytona.ts:109-153, src/lib/sandbox/daytona.ts:220, src/lib/sandbox/daytona.ts:352-364). Under a slow or fallback create, other tabs will misclassify the live placeholder as stale, delete it, and start a second sandbox while the first one is still booting. The create lock needs either a much longer TTL or a heartbeat/refresh mechanism while creation is in flight.

1. Insert real sandbox record BEFORE deleting placeholder to prevent null window. Waiting tabs see the real record immediately via getSandbox query, never a gap where both are absent. 2. Increase stale pending lock TTL from 2min to 5min. Cold create can take up to 5min (300s fallback + health check). The old 2min cutoff could expire during a healthy create, causing duplicate sandbox creation. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Jing-yilin

Codex Review

[high] Failed real-record insert still reopens the duplicate-create race

After createHermesSandbox() succeeds, the new handoff path sets the UI to running, swallows any createSandboxRecord() failure, and then unconditionally deletes the pending-* placeholder (src/app/[...skillPath]/page.tsx:241-265). If that mutation fails because of a transient auth/network problem, the Daytona sandbox is already alive but there is no real Convex record for it. Waiting tabs now explicitly retry whenever userSandbox becomes null (src/app/[...skillPath]/page.tsx:275-282), so as soon as the placeholder is removed they can acquire a fresh create lock and cold-create a second sandbox, orphaning the first one. This path needs to keep the placeholder until the real record is durably written, or tear the sandbox down when the insert fails.

Refs: src/app/[...skillPath]/page.tsx:241, src/app/[...skillPath]/page.tsx:248, src/app/[...skillPath]/page.tsx:264, src/app/[...skillPath]/page.tsx:276

…round 5) If the real record insert fails after createHermesSandbox succeeds, destroy the sandbox immediately instead of leaving it orphaned. Only delete the placeholder after the real record is durably written. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Jing-yilin

[high] Mixed-config concurrent installs can still publish the wrong `configHash`

When a launch starts from a sandbox whose hash already matches, we set sameConfig and call installSkill(..., { skipConfigWrite: true }) (src/app/[...skillPath]/page.tsx:93, src/app/[...skillPath]/page.tsx:147; src/lib/sandbox/daytona.ts:283-293). But the success path still unconditionally writes that launch's configHash back to Convex (src/app/[...skillPath]/page.tsx:157-163). So if tab A starts with the current config, tab B concurrently changes provider/model/key and rewrites the sandbox config files, and then A finishes last, Convex flips back to A's old hash even though the sandbox is still configured with B's settings. The next launch with A's settings can now take the instant path and reuse a sandbox running the wrong model/provider/key. This needs a reread/CAS before persisting the hash, or only updating configHash when this launch actually rewrote the config files.

[medium] Failed sandbox-record persistence leaves the page in a fake "running" state

After createHermesSandbox() returns we immediately set session, sandboxState = "running", and phase = "running" before trying to persist the real record (src/app/[...skillPath]/page.tsx:241-245). If createSandboxRecord() then fails, the catch only sets sandboxState/sandboxError (src/app/[...skillPath]/page.tsx:266-271). phase stays "running" and session stays populated, so the render path still shows ChatPanel instead of the retry/error UI even though we are tearing the sandbox down and have no durable record for it. That path should clear the session and move back to launching or config before surfacing the error.

1. Only update configHash in Convex when config files were actually written (skipConfigWrite=false). Prevents stale hash republication when concurrent installs with different configs overlap. 2. Clear session and reset phase to 'config' when createSandboxRecord fails. Previously the page stayed in 'running' phase showing ChatPanel while the sandbox was being destroyed. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Jing-yilin

[medium] Round 6 error handling now falls through to a blank screen

The new createSandboxRecord() failure path resets phase to "config" (src/app/[...skillPath]/page.tsx:266-275), but the config-phase render only covers sign-in, key loading, onboarding, userCancelled, or readyToAutoLaunch (src/app/[...skillPath]/page.tsx:421-460). In the normal auto-launch case, userCancelled.current is still false and autoLaunchLock is still set, so none of those branches render anything. That means a transient record-persist failure now drops the user onto an empty page with no visible error or retry path, instead of the launch error UI this change was trying to restore.

Keeping the page in phase === "launching" for this error, or adding an explicit config-phase fallback when sandboxState === "error", would preserve a recoverable UI.

Setting phase to 'config' on createSandboxRecord failure left the page blank because the config render path has no error branch in auto-launch mode. Stay in 'launching' phase where LaunchProgress renders the error with retry button. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

vercel bot deployed to Preview April 20, 2026 18:22 View deployment