scannerust

Blazingly fast license detection for your code.

Scans source files for license declarations and returns identified licenses with SPDX expressions, confidence scores, and line locations. Built in Rust for speed, powered by a database of 40,000+ license detection rules.

Features

Fast -- parallel file scanning with per-file timeouts
Comprehensive -- 40,000+ rules covering 2,000+ known licenses
ScanCode-compatible -- uses the ScanCode toolkit rule database and produces compatible JSON output
Index caching -- builds the rule index once, loads from cache in under a second
Library + CLI -- use as a Rust library or a standalone command-line tool
Multiple matching strategies -- hash, SPDX identifier, Aho-Corasick, and approximate sequence matching

Prerequisites

Rust 1.85+ (edition 2024) -- install via rustup:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Git

Installation

git clone https://github.com/user/scannerust.git
cd scannerust
cargo install --path .

Initialize license data

scannerust needs ScanCode's license and rule data (~40k files). There are two options:

Option A: Download via init (recommended for end users)

scannerust init

This downloads the data to ~/.local/share/scannerust/data/. To use a custom location:

scannerust init --data-dir /path/to/data
scannerust --data-dir /path/to/data <files...>

Option B: Git submodule (for development)

git submodule update --init vendor/scancode-toolkit

scannerust automatically detects the vendor data when present.

Quick start

# Scan a file
scannerust path/to/file.py

# Scan a directory recursively, output pretty-printed JSON
scannerust --json-pp results.json src/

# Pipe from stdin
cat LICENSE | scannerust

# Scan for copyrights
scannerust -c path/to/file.py

CLI reference

scannerust [OPTIONS] [FILES/DIRS...]

Reads from stdin if no files are given. Directories are scanned recursively.

Flag	Description
`--json FILE`	Write compact JSON output to FILE
`--json-pp FILE`	Write pretty-printed JSON output to FILE
`-l, --license`	Scan for licenses (default if no scan type specified)
`-c, --copyright`	Scan for copyrights
`--license-score N`	Minimum score 0-100 (default: 0)
`--license-text`	Include matched text in results
`--follow-references`	Resolve license references to local files (e.g. "See COPYING")
`--strip-root`	Strip common root directory from paths
`-n, --processes N`	Number of parallel threads (default: num_cpus - 1)
`--timeout SECS`	Per-file timeout in seconds (default: 120)
`--reindex`	Force index rebuild, ignoring cache
`--cache-dir DIR`	Custom cache directory
`-v, --verbose`	Verbose output
`-q, --quiet`	Suppress non-error output

Subcommands

init -- Download license/rule data from ScanCode:

scannerust init                          # default location
scannerust init --data-dir /path/to/data # custom location
scannerust init --branch v32.3.0         # specific ScanCode version

compare -- Compare two scan result JSON files (e.g. to check for regressions):

scannerust compare baseline.json current.json
scannerust compare baseline.json current.json --format json

Library API

use scannerust::models::{License, Rule};
use scannerust::index::LicenseIndex;
use scannerust::detection::detect;

// Load license and rule data
let licenses = License::load_dir("vendor/scancode-toolkit/src/licensedcode/data/licenses")?;
let rules = Rule::load_dir("vendor/scancode-toolkit/src/licensedcode/data/rules")?;

// Build the index (~40k rules)
let index = LicenseIndex::build(&licenses, rules)?;

// Detect licenses in text
let (detections, _unknowns) = detect(&index, "Permission is hereby granted, free of charge...");
for det in &detections {
    println!("{}: score={:.0}%, lines {}-{}",
        det.license_expression, det.score * 100.0,
        det.start_line, det.end_line);
}

The index can be cached to disk for fast subsequent loads:

// Save to cache
index.save_cache(&cache_path, licenses_dir, rules_dir)?;

// Load from cache (validates data hasn't changed)
let index = LicenseIndex::load_cached(&cache_path, licenses_dir, rules_dir)?;

Output format

JSON output (--json / --json-pp) is compatible with ScanCode toolkit's output format. Each scanned file includes:

License detections with SPDX expressions, scores, and match details
Start/end line numbers for each detection
Optionally, the matched text (--license-text)
Copyright holders and authors (-c)

How it works

The scanner applies a multi-stage matching pipeline to each input file:

Hash matching -- fast exact match via content hashing
SPDX identifier matching -- detects SPDX-License-Identifier: expressions
Aho-Corasick matching -- exact token sequence matching against rule fragments
Approximate sequence matching -- fuzzy alignment for partial or modified license text
Post-processing -- merge overlapping matches, filter false positives, compute scores, group into detections

Caching

The rule index is cached at ~/.cache/scannerust/index_cache.bin (or $XDG_CACHE_HOME/scannerust/). The cache is automatically invalidated when license or rule data files change. Use --reindex to force a rebuild, or --cache-dir to specify a custom location.

Known limitations

No PDF or binary file content extraction (binary files are skipped; archive/media files are detected and skipped)
Proprietary-license post-processing not yet implemented

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.claude/skills		.claude/skills
notes		notes
patches		patches
scripts		scripts
src		src
tests		tests
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scannerust

Features

Prerequisites

Installation

Initialize license data

Quick start

CLI reference

Subcommands

Library API

Output format

How it works

Caching

Known limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scannerust

Features

Prerequisites

Installation

Initialize license data

Quick start

CLI reference

Subcommands

Library API

Output format

How it works

Caching

Known limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages