Skip to content

brotli-compress .socket.facts.json on upload#1291

Merged
Benjamin Barslev Nielsen (barslev) merged 4 commits into
v1.xfrom
barslev/compress-socket-facts-file
May 20, 2026
Merged

brotli-compress .socket.facts.json on upload#1291
Benjamin Barslev Nielsen (barslev) merged 4 commits into
v1.xfrom
barslev/compress-socket-facts-file

Conversation

@barslev
Copy link
Copy Markdown
Contributor

@barslev Benjamin Barslev Nielsen (barslev) commented Apr 30, 2026

Summary

Brotli-compress .socket.facts.json at upload time. Coana keeps writing plain JSON; only the wire bytes between socket-cli and depscan change. ~85% reduction on the simple-npm fixture; production mono-repo payloads see 60+ MB → ~10 MB.

How

  • New compressSocketFactsForUpload(scanPaths) in src/utils/coana.mts — async + streaming (createReadStream → createBrotliCompress → createWriteStream via pipeline). For any .socket.facts.json in scanPaths, writes a sibling .socket.facts.json.br next to the source file. Returns swapped paths + cleanup callback.
  • handle-create-new-scan.mts wraps fetchCreateOrgFullScan in try { compress; upload; } finally { cleanup; }.

Why sibling-write (not os.tmpdir(), not a temp subdir)

The SDK passes path.relative(cwd, brPath) as the multipart name. depscan's addStreamEntry rejects entries with .. traversal — a tmpdir path resolves to ../../../var/folders/... and gets dropped to unmatchedFiles. Sibling-write keeps the relative path inside cwd, and keeps directory shape symmetric with the plain .socket.facts.json upload (depscan strips the .br suffix at ingest, so <dir>/.socket.facts.json.br and <dir>/.socket.facts.json resolve to the same storage path).

Concurrent scans against the same source dir already collide on .socket.facts.json itself (coana writes to a single path), so the sibling .br doesn't introduce a new race.

Tests

  • 16 unit cases in src/utils/coana.test.mts (sibling-path round-trip, passthrough, idempotent cleanup that leaves the source intact, plus the existing extractTier1ReachabilityScanId / extractReachabilityErrors matrix).
  • e2e socket scan create --reach against simple-npm: scan_type=socket_tier1, unmatchedFiles: [], root manifest entries are .socket.facts.json and package.json (storage path matches the plain upload — dir-preservation symmetry holds).

Note

Medium Risk
Changes the scan creation upload pipeline to rewrite .socket.facts.json inputs into temporary Brotli-compressed .br siblings, which could break or alter scan uploads if path handling or cleanup fails.

Overview
Scan uploads now Brotli-compress Coana facts files. handleCreateNewScan calls new compressSocketFactsForUpload() to replace any .socket.facts.json entries in scanPaths with sibling .socket.facts.json.br files for the fetchCreateOrgFullScan multipart upload, then always removes those .br files in a finally block.

Adds a streaming compressor utility + tests. utils/coana.mts implements the brotli streaming pipeline and returns { paths, cleanup }, and utils/coana.test.mts adds unit coverage for path swapping, passthrough behavior, missing files, brotli round-trip, and idempotent cleanup (plus edge cases for the existing extractors).

Reviewed by Cursor Bugbot for commit bb96685. Configure here.

@jdalton
Copy link
Copy Markdown
Collaborator

This is rad!

@barslev Benjamin Barslev Nielsen (barslev) changed the title refactor: brotli-compress .socket.facts.json on upload brotli-compress .socket.facts.json on upload May 1, 2026
John-David Dalton (jdalton) added a commit that referenced this pull request May 6, 2026
…) (#1305)

Port of barslev's PR #1291 from v1.x to main. depscan's api-v0
multipart boundary streams brotli decode based on the .br filename
suffix, so the facts file uploads as ~10x smaller without changing
the on-disk format coana writes.

Adds packages/cli/src/utils/coana/compress-facts.mts with
compressSocketFactsForUpload(scanPaths) — streams brotli a sibling
.socket.facts.json.br next to each source file and returns swapped
paths plus a cleanup() callback. Sibling-write keeps the multipart
entry name inside cwd (depscan drops .. traversal entries).
handleCreateNewScan calls it just before fetchCreateOrgFullScan and
runs cleanup() in finally so the .br files are removed even on
upload failure.

Translation from v1.x to main:
- @socketsecurity/registry/lib/fs -> @socketsecurity/lib/fs
- fs.rm -> safeDelete (fleet-wide rule)
- constants default-import -> named import
  (DOT_SOCKET_DOT_FACTS_JSON already lives in
  packages/cli/src/constants.mts)
- v1.x added scan-id and reachability-error tests in the same file as
  the helper; main keeps utils/coana/extract-scan-id.mts separate and
  has no extractReachabilityErrors yet, so only the new helper's
  tests are added at test/unit/utils/coana/compress-facts.test.mts

Skipping pre-commit test suite via DISABLE_PRECOMMIT_TEST=1 because
test/unit/commands/analytics/output-analytics.test.mts has 3
date-dependent snapshot failures unrelated to this PR (snapshots
encode the literal date Apr 18/19/20/21 and fail on any other day).
Targeted vitest run on the touched files (32 tests in
utils/coana/ + handle-create-new-scan) passes 32/32.
@barslev Benjamin Barslev Nielsen (barslev) marked this pull request as ready for review May 19, 2026 11:20
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit bb96685. Configure here.

Comment thread src/utils/coana.mts
Copy link
Copy Markdown
Contributor

@mtorp Martin Torp (mtorp) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

Compress `.socket.facts.json` with brotli at upload time, just before
`fetchCreateOrgFullScan` POSTs the multipart `/full-scans` body. Coana
keeps writing plain JSON; the local readers
(`extractTier1ReachabilityScanId`, `extractReachabilityErrors`) keep
reading plain JSON; only the on-the-wire bytes between socket-cli and
depscan change. depscan transparently decompresses at the api-v0
multipart ingest boundary in a coordinated change.

Why:
* Mono-repo `.socket.facts.json` files routinely exceed 60MB. On a
  representative simple-npm fixture, brotli compresses 150,748 bytes
  to 19,971 (~86.8% reduction); production mono-repos see a much
  larger absolute saving.
* Coana has a second consumer downstream - teaching coana to write
  `.br` would break it. Brotli on the wire-only path keeps coana's
  contract invariant.
* `tier1ReachabilityScanId` and reachability-error reporting still
  read plain JSON locally; no brotli round-trip on those paths.
* Compression is a transport detail owned by the upload site; cleanup
  is one `finally` block.

Implementation:
* `src/utils/coana.mts` - new exported
  `compressSocketFactsForUpload(scanPaths)` returning `{ paths, cleanup }`.
  Per `.socket.facts.json` entry, `mkdtempSync` a fresh `.socket-br-XXX/`
  directory inside the source's parent dir (NOT under `os.tmpdir()` -
  see below), `brotliCompressSync` bytes to
  `${td}/.socket.facts.json.br`, swap the path. Other paths pass through.
  Missing facts paths pass through. Cleanup is idempotent with
  `force: true`.
* `src/commands/scan/handle-create-new-scan.mts` - wraps the
  `fetchCreateOrgFullScan` call in
  `try { compress; upload; } finally { cleanup(); }`. Cleanup runs on
  success, throw, or any abort path.

Why the temp lives next to the source:

The SDK computes `path.relative(cwd, brPath)` for the multipart name
field. depscan's multipart ingest (`addStreamEntry`) checks
`containsTraversal(unixified)` and bails on any `..` segment. A
tmpdir-based path resolves to `../../../var/folders/...`, gets dropped
to `unmatchedFiles`, and the SocketFacts content never lands in the
scan. Putting the temp inside `path.dirname(originalFactsPath)`
produces a relative path like `.socket-br-XXX/.socket.facts.json.br` -
traversal-free, ingests cleanly.

Tests:
* `src/utils/coana.test.mts` - 16 cases.
  - `compressSocketFactsForUpload` x 5: round-trip JSON via
    `brotliDecompressSync`, basename swap to `.socket.facts.json.br`,
    non-facts paths pass through, missing facts paths pass through,
    cleanup idempotent, mixed-entry order. Pins the contract that the
    temp lives in a subdir of the source's parent (traversal-free).
  - `extractTier1ReachabilityScanId` x 7: plain JSON, missing file,
    missing field, null, empty/blank, trim, numeric coercion.
  - `extractReachabilityErrors` x 4: extraction, missing file,
    no-components, no-inner-reachability.

This change requires the matching depscan multipart decompression
patch on the receiving side; that change ships first.
Previous form wrapped each `.br` in a per-scan `mkdtempSync` subdir
under the source dir for concurrency isolation. That created a
directory-handling asymmetry on depscan's side: a wire path of
`dirA/.socket.facts.json.br` flattened to `.socket.facts.json` at
root via depscan's basename-strip, while plain `dirA/.socket.facts.json`
preserved the dir.

depscan dropped the basename-strip in the corresponding PR; switch
to writing `.br` as a sibling of the source so wire and storage
paths match for both branches. Net effect: a brotli upload from
`<source>/.socket.facts.json` lands at `<source>/.socket.facts.json`
in the manifest tarball - identical to plain.

Concurrent-scan note: coana already writes to a single source path,
so the source `.socket.facts.json` itself is racy under concurrent
runs against the same dir; the sibling `.br` doesn't introduce a
new race that wasn't already there.

* `src/utils/coana.mts`: `compressSocketFactsForUpload` writes
  `${p}.br` next to the source; `cleanup()` does
  `rm(brPath, { force: true })` per file. Drops `mkdtemp` import
  that's no longer used.
* `src/utils/coana.test.mts`: directory-shape assertion replaced
  with `swappedPath === ${input}.br` (sibling). First test now
  also asserts the source survives `cleanup()`. 16 cases.
Per Cursor Bugbot finding on src/utils/coana.mts: if any brotli
pipeline in the `Promise.all` batch rejects, the helper rejected
before returning its `cleanup` callback, so already-completed
sibling `.br` files were orphaned next to their source.

Three changes that together make the helper failure-safe:

* Track the intended `.br` path BEFORE the pipeline starts. Previous
  code pushed to `brPaths` after `await pipeline(...)`, so a sibling
  that the pipeline started writing but didn't finish was invisible
  to cleanup. `rm({ force: true })` no-ops on missing paths, so
  tracking the intent is safe.
* Switch the parallel batch from `Promise.all` to `Promise.allSettled`.
  With `all`, the first rejection bubbled out while sibling pipelines
  were still writing bytes; calling `cleanup()` then would race the
  live writes and could `rm` a sibling only to have it re-created
  immediately. `allSettled` waits for every pipeline to settle, so
  cleanup sees a stable set of files to remove.
* On batch failure, run `cleanup()` internally before re-throwing so
  the caller (which only gets a chance to call cleanup when the
  helper resolves) doesn't have to. Add `recursive: true` to the rm
  call so a defensively-occupied directory at a `.br` path doesn't
  halt cleanup partway through the list.

Test added: `removes partial .br siblings if compression fails
mid-batch` — pre-occupies entry B's `.br` destination with a
directory so its `createWriteStream` rejects with EISDIR, then
asserts that entry A's `.br` is not orphaned on disk after the
batch rejects and both source files survive untouched.
@barslev Benjamin Barslev Nielsen (barslev) force-pushed the barslev/compress-socket-facts-file branch from dbed10b to 06746ea Compare May 20, 2026 07:39
Bump version to 1.1.98 and add changelog entry for the brotli
upload-compression refactor in this PR.
@barslev Benjamin Barslev Nielsen (barslev) force-pushed the barslev/compress-socket-facts-file branch from 06746ea to 084f311 Compare May 20, 2026 07:58
@barslev Benjamin Barslev Nielsen (barslev) merged commit 5c3cae4 into v1.x May 20, 2026
12 checks passed
@barslev Benjamin Barslev Nielsen (barslev) deleted the barslev/compress-socket-facts-file branch May 20, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants