Skip to content

kipeum86/document-redactor

Repository files navigation

document-redactor

document-redactor hero banner

⬇️ Download the tool

Download document-redactor.html
Single HTML · ~262 KB · open locally
Download SHA-256 sidecar
Integrity check · 89 bytes
View all releases
Release notes, older builds

Double-click the downloaded HTML to open in your browser.
Verify with shasum -a 256 -c document-redactor.html.sha256 first.

Open Korean README Open usage guide Open rules guide

CI Apache 2.0 license single HTML distribution 262 KB artifact zero network requests rule-based engine AI none

Offline DOCX redaction for legal work.
Open one local HTML file, review deterministic matches, and download a verified .redacted.docx
without sending the source document anywhere.

Important

This is the safety step before AI. document-redactor is intentionally not AI-powered. It is the local pre-upload filter you run before a contract, memo, pleading, or court document goes into any LLM.

At A Glance

One file
The shipped product is document-redactor.html. No installer, no backend, no asset tree, no auto-update channel.
Rule-based
Detection is deterministic, auditable, and regression-testable. No remote inference and no hidden model behavior.
Local-only
The app opens as a file:// page, uses strict CSP, and is built around a zero-network runtime model.
Verified output
Redaction is not trusted blindly. The output DOCX is re-parsed and checked before download.

Current Guardrails

  • Input files are capped at 50 MB.
  • Any single decompressed ZIP entry is capped at 20 MB.
  • DOCX relationship files (*.rels) are checked during verification.
  • External http:// and https:// relationship targets are stripped from output, and selected literals found in relationship targets are repaired before verification.

What Problem It Solves

Legal teams increasingly want to send contracts, pleadings, memos, and court documents into AI assistants for summary, issue spotting, or clause review. The blocker is obvious: those files contain company names, people, phone numbers, IDs, bank data, case references, and other strings you should not upload raw.

Manual redaction inside Word is slow, repetitive, and easy to get wrong.

document-redactor turns that pre-upload cleanup into a local workflow:

  1. Open one HTML file from disk.
  2. Drop a .docx.
  3. Review grouped candidates and inline highlights.
  4. Apply redaction.
  5. Download a verified .redacted.docx.

What It Is Vs. What It Is Not

What it is What it is not
An offline browser tool for legal DOCX redaction A cloud redaction service
One downloadable HTML artifact plus a hash sidecar An installer, daemon, or desktop app
A rule-based, deterministic review-and-redact pipeline An AI model or probabilistic black box
A product whose artifact and source can be audited directly A system you must trust without inspection
A pre-AI safety layer A replacement for your downstream AI assistant

The Workflow

flowchart TD
    A["📄 <b>Open</b><br/>document-redactor.html"] --> B["📥 <b>Drop</b><br/>your .docx file"]
    B --> C["⚙️ <b>Parse</b><br/>ZIP + XML<br/>locally in-browser"]
    C --> D["🔍 <b>Detect</b><br/>candidates via<br/>deterministic rules"]
    D --> E["👀 <b>Review</b><br/>by section +<br/>inline preview"]
    E --> F["✂️ <b>Apply</b> redaction<br/>+ scrub metadata<br/>+ strip fields"]
    F --> G{"✅ <b>Verify</b><br/>round-trip scan<br/>+ rels check"}
    G -->|clean| H["💾 <b>Download</b><br/>.redacted.docx<br/>+ SHA-256 sidecar"]
    G -->|leak detected| I["🔴 Risk review<br/>+ jump to<br/>survived item"]

    classDef default fill:#0f172a,stroke:#1e3a5f,stroke-width:2px,color:#f8fafc;
    classDef action fill:#0f766e,stroke:#14b8a6,color:#ffffff;
    classDef verify fill:#1d4ed8,stroke:#60a5fa,color:#ffffff;
    classDef success fill:#166534,stroke:#22c55e,color:#ffffff;
    classDef fail fill:#991b1b,stroke:#ef4444,color:#ffffff;
    class F action;
    class G verify;
    class H success;
    class I fail;
Loading

Release Snapshot

Artifact
document-redactor.html
Current checked size
262 KB
268,571 bytes
Integrity sidecar
89 bytes
Runtime network calls
0
Automated coverage
1,700+ tests

Current checked release artifact on April 30, 2026:

  • document-redactor.html SHA-256: 363d7c93008038a6e56137ab0a43251771f8911c7d7aad6e21cd6771a6a8003a
  • Verified locally with shasum -a 256 -c document-redactor.html.sha256

What The Current Release Does

Local DOCX traversal
Walks body, headers, footers, footnotes, endnotes, comments, and relationship references inside the DOCX package.
Structured review UX
Groups candidates by parties, aliases, identifiers, amounts, dates, entities, case/docket references, heuristics, and catch-all additions.
Verification-guided export
Re-checks the generated output, reports residual survivors clearly, and keeps warnings separate from verified-clean downloads.
Inline preview
Shows the document text with selection-aware highlights so review happens in context, not in a blind list.
OOXML leak hardening
Flattens risky field and hyperlink structures, strips comments, scrubs metadata, and normalizes redaction across split runs.
Manual recovery paths
Lets users add missed strings, reuse local policy JSON files, jump back to surviving items, and acknowledge residual risk when they still need the file.

For the public detection catalog, see docs/RULES_GUIDE.md.

Why This Architecture

Single HTML instead of a web app

  • Easier to use: download once, double-click, redact.
  • Easier to audit: one shipped artifact, not a service mesh.
  • Easier to distribute: GitHub Releases, USB, email, Kakao, shared drives.
  • Easier to trust: no backend means no server-side document path to defend.

Rule-based instead of ML or an LLM

  • Sensitive documents never need model inference.
  • Behavior is deterministic and explainable.
  • Regression testing is straightforward.
  • The artifact stays small enough to remain practical as a local HTML tool.

This choice is deliberate. The AI assistant comes after redaction, not inside it.

Raw OOXML handling instead of high-level document abstractions

DOCX files are ZIP archives of XML parts. Using JSZip plus direct WordprocessingML traversal gives the project the control it needs to:

  • detect matches across split text runs,
  • scan more than just the body text,
  • rewrite only the affected segments,
  • verify the exact output it produces.

Svelte 5 plus single-file bundling

The UI needs to feel modern without blowing up the artifact. Svelte 5 and vite-plugin-singlefile give the project:

  • fast local interactivity,
  • a small runtime footprint,
  • one-file packaging that still supports a real review workflow.

Quick Start

1. Download the release

2. Verify the artifact

sha256sum -c document-redactor.html.sha256
# expected output:
# document-redactor.html: OK

If sha256sum is not available on your Mac:

shasum -a 256 -c document-redactor.html.sha256

3. Open the tool

Double-click document-redactor.html. It opens as a file:// page in your browser. There is no install step and no account setup.

4. Run a redaction

  • Drop a .docx
  • Review candidates
  • Click Apply and verify
  • Download {original}.redacted.docx

For a detailed walkthrough, see USAGE.md. For the Korean guide, see USAGE.ko.md.

Trust Model

Layer Mechanism Why it matters
Source ESLint bans fetch, XMLHttpRequest, WebSocket, EventSource, sendBeacon, and similar primitives Network code is stopped before it casually enters the app
Build Single-file ship gate rejects external JS or CSS references and writes a SHA-256 sidecar The release stays auditable as one artifact
Runtime Embedded CSP uses default-src 'none' and connect-src 'none' The browser blocks outbound requests at execution time
Export Round-trip verification re-parses the generated DOCX The app does not silently ship a leaky output

Note

The privacy story here is not just policy language. It is enforced in source code, build rules, runtime policy, and export verification.

Tech Stack

Layer Choice Why this choice
Distribution Single document-redactor.html + .sha256 Simplest release artifact and easiest thing to verify
Package manager Bun 1.x Fast local workflow and a light toolchain
Build Vite 8 Clear plugin hooks and dependable modern bundling
Single-file packaging vite-plugin-singlefile Inlines JS and CSS into one shipped HTML file
UI Svelte 5 Fine-grained reactivity with a small runtime
DOCX engine JSZip + raw OOXML traversal Precise control over read, rewrite, and verify
Detection Rule-based regex + structural classifiers Deterministic, inspectable, lightweight
Verification Round-trip scan + word-count sanity + SHA-256 Catches leaks, flags suspicious over-redaction, verifies artifacts
Quality gates Vitest + strict TypeScript + svelte-check Strong regression safety for a trust-sensitive product

Public Repo Surface

Internal phase briefs and planning notes are intentionally being removed from the public git surface going forward.

Known Limitations

  • DOCX only. PDF requires a different pipeline.
  • The preview is review-oriented, not a pixel-faithful Word layout clone.
  • Standard is the only implemented redaction level today.
  • No OCR for text embedded inside images.
  • No traversal into embedded OLE objects.
  • No macros/VBA or encrypted/password-protected DOCX packages.
  • No SmartArt or WordArt extraction.

Developer Workflow

git clone https://github.com/kipeum86/document-redactor.git
cd document-redactor
bun install
bun run test
bun run typecheck
bun run lint
bun run build
open dist/document-redactor.html

Notes:

  • For browser QA, test the built dist/document-redactor.html, not the dev server.
  • The repository currently carries 1,700+ automated tests across detection, DOCX rewriting, verification, UI state, and ship gates.
  • dist/ is ignored in git; releases should publish the built HTML and its .sha256 sidecar from CI or from a verified local build.

License

Apache License 2.0

Built by @kipeum86.

About

A privacy-preserving, in-browser DOCX redactor for Korean + English legal documents. 오프라인 기반 워드 문서 비식별화 도구

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors