β‘ Fast, reliable, low-latency web crawling for Rust πΈοΈ
- Fast β 100k+ pages in 1β10 minutes, benchmarked against the real internet.
- Low-latency β pages stream the moment they're fetched. No batching, no waiting.
- Reliable β battle-tested concurrency, automatic retries, proxy hedging, and graceful shutdown.
- Scales with you β one script to a fleet of workers, same API.
[dependencies]
spider = "2"use spider::{tokio, website::Website};
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
let mut rx = website.subscribe(16);
tokio::spawn(async move {
while let Ok(page) = rx.recv().await {
println!("{} {}", page.status_code, page.get_url());
}
});
website.crawl().await;
website.unsubscribe();
}That's the whole program. Pages stream as they arrive. The crawler stops when there's nothing left to fetch.
Want JavaScript rendering? Add features = ["chrome"] and call crawl_smart() instead β Spider will use HTTP first and only spin up Chrome for pages that need it.
- Scrapers that turn live websites into Markdown, plain text, JSON, or WARC archives.
- Pipelines that ingest pages into search indexes, vector stores, or databases as they're fetched β no temp directories, no full-crawl waits.
- AI agents that browse, click, fill forms, and reason about pages using OpenAI, Gemini, or any OpenAI-compatible API.
- Monitors that re-crawl on a cron schedule and emit only what changed.
- Headless browser automation at scale β locally with Chrome or WebDriver, or remotely via Spider Cloud's managed browser pool.
Streams in real time. Subscribe once, get every page as it lands. Spider uses Tokio broadcast channels and event-driven wakeups under the hood β your consumer is never starved and the crawler never blocks on a slow downstream.
Handles JavaScript when it has to. "Smart mode" starts with plain HTTP and transparently upgrades to headless Chrome the moment it detects a page needs it. You don't pay the Chrome tax on pages that don't.
Survives the modern web. Real-browser fingerprint emulation, header spoofing, proxy rotation, request hedging across proxies, and stealth Chrome β built in. For sites that fight back hard, Spider Cloud's Smart mode auto-detects Cloudflare, Akamai, Imperva, and friends and switches to its unblocker without a config change.
Respects the rules when you want it to. Robots.txt, sitemaps, per-path budgets, depth limits, allow/deny lists, glob and regex filters, cron schedules β all one method call away.
Stays out of your way. The whole crawler is one fluent builder. Defaults are good. There's nothing you have to configure to get started.
Scales when you need it. Adaptive concurrency, per-domain rate limiting, HTTP/2 multiplexing, io_uring on Linux, request coalescing, and an optional decentralized mode that splits work across worker processes over IPC.
use spider::website::Website;
use std::{collections::HashMap, time::Duration};
let mut website = Website::new("https://example.com")
.with_limit(50) // concurrent requests
.with_depth(10) // how deep to follow links
.with_delay(500) // polite pause between hits (ms)
.with_request_timeout(Some(Duration::from_secs(30)))
.with_respect_robots_txt(true)
.with_subdomains(true)
.with_user_agent(Some("MyBot/1.0"))
.with_blacklist_url(Some(vec!["/admin".into()]))
.with_whitelist_url(Some(vec!["/blog".into()]))
.with_proxies(Some(vec!["http://proxy:8080".into()]))
.with_budget(Some(HashMap::from([("/blog", 100), ("*", 1000)])))
.with_caching(true)
.with_stealth(true)
.build()
.unwrap();Every option is documented in the Configuration API reference.
If you don't want to run proxies, manage residential IPs, keep a Chrome pool warm, or chase Cloudflare update cycles β point Spider at Spider Cloud and the same code runs against a managed crawling backend. Free tier on signup.
use spider::configuration::{SpiderCloudConfig, SpiderCloudMode};
let cloud = SpiderCloudConfig::new("sk-...")
.with_mode(SpiderCloudMode::Smart);
let mut website = Website::new("https://example.com")
.with_spider_cloud_config(cloud)
.build()?;Five cloud modes β Proxy, Api, Unblocker, Fallback, Smart β let you trade cost for resilience without changing your crawler code.
| Use Spider as⦠| Command |
|---|---|
| A Rust library | cargo add spider |
| A command-line tool | cargo install spider_cli |
| A Node.js package | npm i @spider-rs/spider-rs |
| A Python package | pip install spider_rs |
| An MCP server (for Claude, Cursor, etc.) | cargo install spider_mcp |
| Managed crawling | Sign up at spider.cloud |
The workspace ships nine crates. Most users only need spider itself.
| Crate | What it's for |
|---|---|
spider |
The crawler. Start here. |
spider_cli |
A standalone command-line crawler. |
spider_worker |
A worker process for decentralized crawls. |
spider_agent |
An autonomous AI browsing agent (Chrome / WebDriver + LLMs). |
spider_mcp |
A Model Context Protocol server so AI tools can call Spider directly. |
spider_utils, spider_agent_types, spider_agent_html |
Supporting libraries. |
There are 50+ runnable examples in examples/ β Chrome automation, screenshots, OpenAI extraction, WARC export, cron scheduling, decentralized crawling, sitemap-only mode, anti-bot setups, and more.
- API reference β https://docs.rs/spider
- Guides & recipes β https://spider.cloud/guides
- Examples β
./examples/ - Development notes β
CLAUDE.md
- π¬ Discord β questions, ideas, show-and-tell
- π GitHub Issues β bug reports and feature requests
- ποΈ Spider blog β release notes and deep dives
Spider is open source and we love good pull requests. See CONTRIBUTING.md. For tests:
cargo test -p spider # unit tests
RUN_LIVE_TESTS=1 cargo test # live network testsMIT β use it for anything, commercial or otherwise.