Skip to content

doubleopen-project/scannerust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

129 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scannerust

Blazingly fast license detection for your code.

Scans source files for license declarations and returns identified licenses with SPDX expressions, confidence scores, and line locations. Built in Rust for speed, powered by a database of 40,000+ license detection rules.

Features

  • Fast -- parallel file scanning with per-file timeouts
  • Comprehensive -- 40,000+ rules covering 2,000+ known licenses
  • ScanCode-compatible -- uses the ScanCode toolkit rule database and produces compatible JSON output
  • Index caching -- builds the rule index once, loads from cache in under a second
  • Library + CLI -- use as a Rust library or a standalone command-line tool
  • Multiple matching strategies -- hash, SPDX identifier, Aho-Corasick, and approximate sequence matching

Prerequisites

  • Rust 1.85+ (edition 2024) -- install via rustup:
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  • Git

Installation

git clone https://github.com/user/scannerust.git
cd scannerust
cargo install --path .

Initialize license data

scannerust needs ScanCode's license and rule data (~40k files). There are two options:

Option A: Download via init (recommended for end users)

scannerust init

This downloads the data to ~/.local/share/scannerust/data/. To use a custom location:

scannerust init --data-dir /path/to/data
scannerust --data-dir /path/to/data <files...>

Option B: Git submodule (for development)

git submodule update --init vendor/scancode-toolkit

scannerust automatically detects the vendor data when present.

Quick start

# Scan a file
scannerust path/to/file.py

# Scan a directory recursively, output pretty-printed JSON
scannerust --json-pp results.json src/

# Pipe from stdin
cat LICENSE | scannerust

# Scan for copyrights
scannerust -c path/to/file.py

CLI reference

scannerust [OPTIONS] [FILES/DIRS...]

Reads from stdin if no files are given. Directories are scanned recursively.

Flag Description
--json FILE Write compact JSON output to FILE
--json-pp FILE Write pretty-printed JSON output to FILE
-l, --license Scan for licenses (default if no scan type specified)
-c, --copyright Scan for copyrights
--license-score N Minimum score 0-100 (default: 0)
--license-text Include matched text in results
--follow-references Resolve license references to local files (e.g. "See COPYING")
--strip-root Strip common root directory from paths
-n, --processes N Number of parallel threads (default: num_cpus - 1)
--timeout SECS Per-file timeout in seconds (default: 120)
--reindex Force index rebuild, ignoring cache
--cache-dir DIR Custom cache directory
-v, --verbose Verbose output
-q, --quiet Suppress non-error output

Subcommands

init -- Download license/rule data from ScanCode:

scannerust init                          # default location
scannerust init --data-dir /path/to/data # custom location
scannerust init --branch v32.3.0         # specific ScanCode version

compare -- Compare two scan result JSON files (e.g. to check for regressions):

scannerust compare baseline.json current.json
scannerust compare baseline.json current.json --format json

Library API

use scannerust::models::{License, Rule};
use scannerust::index::LicenseIndex;
use scannerust::detection::detect;

// Load license and rule data
let licenses = License::load_dir("vendor/scancode-toolkit/src/licensedcode/data/licenses")?;
let rules = Rule::load_dir("vendor/scancode-toolkit/src/licensedcode/data/rules")?;

// Build the index (~40k rules)
let index = LicenseIndex::build(&licenses, rules)?;

// Detect licenses in text
let (detections, _unknowns) = detect(&index, "Permission is hereby granted, free of charge...");
for det in &detections {
    println!("{}: score={:.0}%, lines {}-{}",
        det.license_expression, det.score * 100.0,
        det.start_line, det.end_line);
}

The index can be cached to disk for fast subsequent loads:

// Save to cache
index.save_cache(&cache_path, licenses_dir, rules_dir)?;

// Load from cache (validates data hasn't changed)
let index = LicenseIndex::load_cached(&cache_path, licenses_dir, rules_dir)?;

Output format

JSON output (--json / --json-pp) is compatible with ScanCode toolkit's output format. Each scanned file includes:

  • License detections with SPDX expressions, scores, and match details
  • Start/end line numbers for each detection
  • Optionally, the matched text (--license-text)
  • Copyright holders and authors (-c)

How it works

The scanner applies a multi-stage matching pipeline to each input file:

  1. Hash matching -- fast exact match via content hashing
  2. SPDX identifier matching -- detects SPDX-License-Identifier: expressions
  3. Aho-Corasick matching -- exact token sequence matching against rule fragments
  4. Approximate sequence matching -- fuzzy alignment for partial or modified license text
  5. Post-processing -- merge overlapping matches, filter false positives, compute scores, group into detections

Caching

The rule index is cached at ~/.cache/scannerust/index_cache.bin (or $XDG_CACHE_HOME/scannerust/). The cache is automatically invalidated when license or rule data files change. Use --reindex to force a rebuild, or --cache-dir to specify a custom location.

Known limitations

  • No PDF or binary file content extraction (binary files are skipped; archive/media files are detected and skipped)
  • Proprietary-license post-processing not yet implemented

License

Apache-2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors