Introduction
Welcome to the scrape-rs documentation. scrape-rs is a high-performance, cross-platform HTML parsing library with a pure Rust core and bindings for Python, Node.js, and WASM.
Why scrape-rs?
scrape-rs is designed to be 10-50x faster than popular HTML parsing libraries while maintaining a consistent, idiomatic API across all platforms.
Key Features
- Blazing Fast: Built on html5ever with SIMD-accelerated text processing
- Cross-Platform: Identical API for Rust, Python, Node.js, and WASM
- Memory Efficient: Arena-based DOM allocation with minimal overhead
- Spec-Compliant: Full HTML5 parsing with comprehensive CSS selector support
- Modern: Support for streaming parsing, compiled selectors, and parallel processing
Performance Highlights
| Operation | BeautifulSoup | Cheerio | scrape-rs | Speedup |
|---|---|---|---|---|
| Parse 1KB HTML | 0.23ms | 0.18ms | 0.024ms | 9.7-7.5x |
| Parse 100KB HTML | 18ms | 12ms | 1.8ms | 10-6.7x |
| CSS selector query | 0.80ms | 0.12ms | 0.006ms | 133-20x |
| Extract all links | 3.2ms | 0.85ms | 0.18ms | 17.8-4.7x |
Quick Example
Rust
#![allow(unused)] fn main() { use scrape_core::Soup; let html = r#"<div class="product"><h2>Widget</h2><span class="price">$19.99</span></div>"#; let soup = Soup::parse(html); let product = soup.find(".product")?.expect("product not found"); let name = product.find("h2")?.expect("name not found").text(); let price = product.find(".price")?.expect("price not found").text(); println!("{}: {}", name, price); }
Python
from scrape_rs import Soup
html = '<div class="product"><h2>Widget</h2><span class="price">$19.99</span></div>'
soup = Soup(html)
product = soup.find(".product")
name = product.find("h2").text
price = product.find(".price").text
print(f"{name}: {price}")
Node.js
import { Soup } from '@scrape-rs/scrape';
const html = '<div class="product"><h2>Widget</h2><span class="price">$19.99</span></div>';
const soup = new Soup(html);
const product = soup.find(".product");
const name = product.find("h2").text;
const price = product.find(".price").text;
console.log(`${name}: ${price}`);
WASM
import init, { Soup } from '@scrape-rs/wasm';
await init();
const html = '<div class="product"><h2>Widget</h2><span class="price">$19.99</span></div>';
const soup = new Soup(html);
const product = soup.find(".product");
const name = product.find("h2").text;
const price = product.find(".price").text;
console.log(`${name}: ${price}`);
Where to Go Next
- New to scrape-rs? Start with the Quick Start guide
- Migrating from another library? Check out the Migration Guides
- API Reference? See Rust docs on docs.rs
Platform Support
| Platform | Status | Package |
|---|---|---|
| Rust | Stable | scrape-core |
| Python 3.10+ | Stable | fast-scrape |
| Node.js 18+ | Stable | @scrape-rs/scrape |
| WASM | Stable | @scrape-rs/wasm |
License
scrape-rs is dual-licensed under Apache 2.0 and MIT. See LICENSE-APACHE and LICENSE-MIT for details.
Installation
scrape-rs provides bindings for multiple platforms. Choose the installation method for your platform:
Rust
Add scrape-core to your Cargo.toml:
[dependencies]
scrape-core = "0.2"
Or use cargo add:
cargo add scrape-core
Feature Flags
scrape-core supports optional features:
[dependencies]
scrape-core = { version = "0.2", features = ["streaming", "parallel", "simd"] }
| Feature | Description | Default |
|---|---|---|
streaming | Enable streaming parser with constant memory usage | No |
parallel | Enable parallel batch processing with Rayon | No |
simd | Enable SIMD-accelerated text processing | No |
serde | Enable serialization support | No |
Python
Install via pip:
pip install fast-scrape
Or with uv:
uv pip install fast-scrape
Requirements
- Python 3.10 or later
- Supported platforms: Linux (x86_64, aarch64), macOS (x86_64, aarch64), Windows (x86_64)
Virtual Environment
We recommend using a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install fast-scrape
Node.js
Install via npm:
npm install @scrape-rs/scrape
Or with pnpm:
pnpm add @scrape-rs/scrape
Or with yarn:
yarn add @scrape-rs/scrape
Requirements
- Node.js 18 or later
- Supported platforms: Linux (x86_64, aarch64), macOS (x86_64, aarch64), Windows (x86_64)
TypeScript Support
TypeScript types are included automatically. No additional @types package is needed.
WASM (Browser)
Install via npm:
npm install @scrape-rs/wasm
Or with pnpm:
pnpm add @scrape-rs/wasm
Usage in Browser
import init, { Soup } from '@scrape-rs/wasm';
// Initialize WASM module (required once)
await init();
const soup = new Soup('<html>...</html>');
Requirements
- Modern browser with WASM support (Chrome 57+, Firefox 52+, Safari 11+, Edge 16+)
- Bundle size: ~400KB (gzipped: ~120KB)
Webpack Configuration
If using Webpack, add to your config:
module.exports = {
experiments: {
asyncWebAssembly: true,
},
};
Vite Configuration
If using Vite, add vite-plugin-wasm:
npm install vite-plugin-wasm
import { defineConfig } from 'vite';
import wasm from 'vite-plugin-wasm';
export default defineConfig({
plugins: [wasm()],
});
Verifying Installation
After installation, verify it works:
Rust
cargo run --example basic
Or create a test file:
use scrape_core::Soup; fn main() { let soup = Soup::parse("<html><body><h1>Hello</h1></body></html>"); println!("{:?}", soup.find("h1")); }
Python
python -c "from scrape_rs import Soup; print(Soup('<h1>Test</h1>').find('h1').text)"
Node.js
node -e "const {Soup} = require('@scrape-rs/scrape'); console.log(new Soup('<h1>Test</h1>').find('h1').text)"
WASM
Create a test HTML file:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>scrape-rs WASM Test</title>
</head>
<body>
<script type="module">
import init, { Soup } from './node_modules/@scrape-rs/wasm/scrape_wasm.js';
await init();
const soup = new Soup('<h1>Hello WASM</h1>');
const h1 = soup.find('h1');
console.log('Success:', h1.text);
document.body.innerHTML = `<p>Result: ${h1.text}</p>`;
</script>
</body>
</html>
Troubleshooting
Rust: Compilation Errors
If you see compilation errors, ensure you're using Rust 1.75 or later:
rustc --version
rustup update
Python: No Matching Distribution
If you get "no matching distribution found", ensure you're using Python 3.10+:
python --version
If on an unsupported platform, you can build from source:
pip install maturin
git clone https://github.com/bug-ops/scrape-rs.git
cd scrape-rs/crates/scrape-py
maturin develop --release
Node.js: Binary Not Found
If the native module fails to load, ensure your platform is supported:
node -p "process.platform + '-' + process.arch"
Supported: linux-x64, linux-arm64, darwin-x64, darwin-arm64, win32-x64
WASM: Module Not Found
Ensure your bundler is configured to handle WASM files. See platform-specific configuration above.
Next Steps
Now that you have scrape-rs installed, proceed to the Quick Start guide to learn the basics.
Quick Start
This guide will get you parsing and querying HTML in under 5 minutes.
Your First Program
Rust
use scrape_core::Soup; fn main() { let html = r#" <html> <body> <div class="product"> <h2>Laptop</h2> <span class="price">$999</span> </div> </body> </html> "#; let soup = Soup::parse(html); if let Ok(Some(product)) = soup.find(".product") { let name = product.find("h2") .ok() .flatten() .map(|t| t.text()) .unwrap_or_default(); let price = product.find(".price") .ok() .flatten() .map(|t| t.text()) .unwrap_or_default(); println!("Product: {}, Price: {}", name, price); } }
Output:
Product: Laptop, Price: $999
Python
from scrape_rs import Soup
html = """
<html>
<body>
<div class="product">
<h2>Laptop</h2>
<span class="price">$999</span>
</div>
</body>
</html>
"""
soup = Soup(html)
product = soup.find(".product")
if product:
name = product.find("h2").text
price = product.find(".price").text
print(f"Product: {name}, Price: {price}")
Node.js
import { Soup } from '@scrape-rs/scrape';
const html = `
<html>
<body>
<div class="product">
<h2>Laptop</h2>
<span class="price">$999</span>
</div>
</body>
</html>
`;
const soup = new Soup(html);
const product = soup.find(".product");
if (product) {
const name = product.find("h2").text;
const price = product.find(".price").text;
console.log(`Product: ${name}, Price: ${price}`);
}
WASM
import init, { Soup } from '@scrape-rs/wasm';
await init();
const html = `
<html>
<body>
<div class="product">
<h2>Laptop</h2>
<span class="price">$999</span>
</div>
</body>
</html>
`;
const soup = new Soup(html);
const product = soup.find(".product");
if (product) {
const name = product.find("h2").text;
const price = product.find(".price").text;
console.log(`Product: ${name}, Price: ${price}`);
}
Core Concepts
Parsing
scrape-rs parses HTML into a document object model (DOM):
#![allow(unused)] fn main() { let soup = Soup::parse(html_string); }
The parser is:
- Spec-compliant: Uses html5ever for HTML5 parsing
- Forgiving: Handles malformed HTML gracefully
- Fast: Parses 100KB in ~2ms on modern hardware
Finding Elements
Use CSS selectors to find elements:
#![allow(unused)] fn main() { // Find first matching element let element = soup.find("div.product")?; // Find all matching elements let elements = soup.find_all("div.product")?; }
Supported selectors:
- Type:
div,span,a - Class:
.product,.price - ID:
#main,#header - Attributes:
[href],[type="text"] - Combinators:
div > span,h1 + p,div span - Pseudo-classes:
:first-child,:last-child,:nth-child(2n)
Extracting Data
Once you have an element, extract its content:
#![allow(unused)] fn main() { let tag = soup.find("h1")?.unwrap(); // Get text content let text = tag.text(); // Get HTML content let html = tag.html(); // Get attribute value if let Some(href) = tag.get("href") { println!("Link: {}", href); } // Check if attribute exists if tag.has_attr("data-id") { // ... } // Check for CSS class if tag.has_class("active") { // ... } }
Navigating the Tree
Traverse the DOM tree:
#![allow(unused)] fn main() { let tag = soup.find("span")?.unwrap(); // Parent element if let Some(parent) = tag.parent() { println!("Parent: {}", parent.name().unwrap()); } // Children for child in tag.children() { println!("Child: {:?}", child.name()); } // Next sibling if let Some(next) = tag.next_sibling() { println!("Next: {:?}", next.name()); } // Previous sibling if let Some(prev) = tag.prev_sibling() { println!("Previous: {:?}", prev.name()); } }
Common Patterns
Extract All Links
#![allow(unused)] fn main() { let soup = Soup::parse(html); for link in soup.find_all("a[href]")? { if let Some(href) = link.get("href") { println!("{}", href); } } }
Extract Table Data
#![allow(unused)] fn main() { let soup = Soup::parse(html); if let Ok(Some(table)) = soup.find("table") { for row in table.find_all("tr")? { let cells: Vec<String> = row .find_all("td")? .iter() .map(|cell| cell.text()) .collect(); println!("{:?}", cells); } } }
Filter by Attribute
#![allow(unused)] fn main() { let soup = Soup::parse(html); // Find all images with alt text for img in soup.find_all("img[alt]")? { let src = img.get("src").unwrap_or(""); let alt = img.get("alt").unwrap_or(""); println!("{}: {}", src, alt); } }
Extract Nested Data
#![allow(unused)] fn main() { let soup = Soup::parse(html); for article in soup.find_all("article.post")? { let title = article .find(".title")? .map(|t| t.text()) .unwrap_or_default(); let author = article .find(".author")? .map(|t| t.text()) .unwrap_or_default(); let date = article .find("time")? .and_then(|t| t.get("datetime")) .unwrap_or(""); println!("{} by {} on {}", title, author, date); } }
Error Handling
scrape-rs uses Result for fallible operations:
Rust
#![allow(unused)] fn main() { use scrape_core::{Soup, Error}; fn extract_title(html: &str) -> Result<String, Error> { let soup = Soup::parse(html); let title = soup .find("title")? .ok_or_else(|| Error::not_found("title"))?; Ok(title.text()) } }
Python
from scrape_rs import Soup, ScrapeError
def extract_title(html):
try:
soup = Soup(html)
title = soup.find("title")
if not title:
raise ScrapeError("Title not found")
return title.text
except ScrapeError as e:
print(f"Error: {e}")
return None
Node.js
import { Soup, ScrapeError } from '@scrape-rs/scrape';
function extractTitle(html: string): string | null {
try {
const soup = new Soup(html);
const title = soup.find("title");
if (!title) {
throw new Error("Title not found");
}
return title.text;
} catch (error) {
console.error(`Error: ${error}`);
return null;
}
}
Performance Tips
For best performance:
- Reuse compiled selectors when querying many documents:
#![allow(unused)] fn main() { use scrape_core::compile_selector; let selector = compile_selector("div.product")?; for html in documents { let soup = Soup::parse(html); let products = soup.find_all_compiled(&selector)?; // Process products... } }
- Use streaming for large files (requires
streamingfeature):
#![allow(unused)] fn main() { use scrape_core::{StreamingSoup, StreamingConfig}; let mut streaming = StreamingSoup::with_config( StreamingConfig::default() ); streaming.on_element("a[href]", |el| { println!("Found link: {:?}", el.get_attribute("href")); Ok(()) })?; }
- Process in parallel (requires
parallelfeature):
#![allow(unused)] fn main() { use scrape_core::parallel::parse_batch; let results = parse_batch(&html_documents)?; }
Next Steps
Now that you know the basics:
- Read Core Concepts for deeper understanding
- Explore the User Guide for advanced features
- Check Migration Guides if coming from another library
Core Concepts
This chapter explains the fundamental concepts behind scrape-rs.
Document Object Model (DOM)
When you parse HTML, scrape-rs builds a Document Object Model (DOM) - a tree representation of the HTML structure.
<html>
<body>
<div class="container">
<h1>Title</h1>
<p>Paragraph</p>
</div>
</body>
</html>
This becomes:
Document
└── html
└── body
└── div.container
├── h1 ("Title")
└── p ("Paragraph")
Arena Allocation
scrape-rs uses arena allocation for DOM nodes, which provides:
- Fast allocation: All nodes allocated in contiguous memory
- No reference counting: Zero overhead compared to
Rc<RefCell<T>> - Cache-friendly: Better CPU cache utilization
- Safe lifetimes: Rust's borrow checker prevents dangling references
Node Types
The DOM contains three types of nodes:
- Element nodes: HTML tags (
<div>,<span>, etc.) - Text nodes: Raw text content
- Comment nodes: HTML comments (if
include_commentsis enabled)
Parsing Modes
scrape-rs offers two parsing approaches:
DOM Parser
The default parser builds the entire document tree in memory:
#![allow(unused)] fn main() { let soup = Soup::parse(html); }
Best for:
- Documents under 10MB
- Random access to elements
- Tree navigation (parent, siblings)
- Multiple queries on the same document
Memory usage: O(n) where n is document size
Streaming Parser
The streaming parser processes HTML incrementally:
#![allow(unused)] fn main() { let mut streaming = StreamingSoup::new(); streaming.on_element("a", |el| { println!("Link: {:?}", el.get_attribute("href")); Ok(()) })?; }
Best for:
- Large documents (100MB+)
- Sequential processing
- One-pass extraction
- Memory-constrained environments
Memory usage: O(1) constant memory
CSS Selectors
CSS selectors are patterns for matching elements:
Basic Selectors
#![allow(unused)] fn main() { // Type selector - matches tag name soup.find("div")? // Class selector - matches class attribute soup.find(".product")? // ID selector - matches id attribute soup.find("#header")? // Universal selector - matches all elements soup.find("*")? }
Attribute Selectors
#![allow(unused)] fn main() { // Has attribute soup.find("[href]")? // Exact match soup.find("[type='text']")? // Contains word soup.find("[class~='active']")? // Starts with soup.find("[href^='https://']")? // Ends with soup.find("[src$='.png']")? // Contains substring soup.find("[href*='example']")? }
Combinators
#![allow(unused)] fn main() { // Descendant - any level deep soup.find("div span")? // Child - direct children only soup.find("ul > li")? // Adjacent sibling - immediately following soup.find("h1 + p")? // General sibling - any following sibling soup.find("h1 ~ p")? }
Pseudo-classes
#![allow(unused)] fn main() { // First/last child soup.find("li:first-child")? soup.find("li:last-child")? // Nth child soup.find("li:nth-child(2)")? // Second child soup.find("li:nth-child(2n)")? // Even children soup.find("li:nth-child(2n+1)")? // Odd children // Empty elements soup.find("div:empty")? // Negation soup.find("input:not([type='hidden'])")? }
Selector Performance
Different selectors have different performance characteristics:
| Selector | Complexity | Notes |
|---|---|---|
#id | O(1) | Uses ID index |
.class | O(n) | Linear scan with early exit |
tag | O(n) | Linear scan |
[attr] | O(n) | Linear scan |
div > span | O(n) | Depends on tree depth |
div span | O(n²) | Checks all descendants |
Optimization tips:
- Start with ID selector when possible:
#container .itemnot.item - Use child combinator (
>) instead of descendant when appropriate - Compile selectors for reuse
Compiled Selectors
For repeated queries, compile the selector once:
#![allow(unused)] fn main() { use scrape_core::compile_selector; // Compile once let selector = compile_selector("div.product")?; // Reuse many times for html in documents { let soup = Soup::parse(html); let products = soup.find_all_compiled(&selector)?; // Process... } }
Performance benefit: ~50% faster for complex selectors
Element References
Elements are represented by the Tag type:
#![allow(unused)] fn main() { let tag = soup.find("div")?.unwrap(); }
Lifetime Relationship
Tag borrows from the Soup:
#![allow(unused)] fn main() { let soup = Soup::parse(html); let tag = soup.find("div")?.unwrap(); // Borrows from soup // soup cannot be modified or dropped while tag is in scope }
This prevents dangling references at compile time.
Copy Semantics
Tag implements Copy, so it can be duplicated cheaply:
#![allow(unused)] fn main() { let tag1 = soup.find("div")?.unwrap(); let tag2 = tag1; // Copies, both valid }
The copy is just a reference (pointer + ID), not the actual element data.
Text Extraction
Text can be extracted in different ways:
Direct Text
#![allow(unused)] fn main() { // Just the text of this element's direct text nodes let text = tag.text(); }
Deep Text
#![allow(unused)] fn main() { // Text of this element and all descendants let all_text = tag.text(); // Default behavior }
Normalized Text
Text extraction automatically:
- Collapses multiple spaces into one
- Trims leading/trailing whitespace
- Converts newlines to spaces
To preserve whitespace, use configuration:
#![allow(unused)] fn main() { let config = SoupConfig::builder() .preserve_whitespace(true) .build(); let soup = Soup::parse_with_config(html, config); }
Error Handling
scrape-rs uses explicit error types:
QueryError
Returned when a CSS selector is invalid:
#![allow(unused)] fn main() { match soup.find("div[") { Err(Error::InvalidSelector { selector }) => { println!("Bad selector: {}", selector); } _ => {} } }
NotFound
Not an error - use Option:
#![allow(unused)] fn main() { // Returns Ok(Some(tag)) or Ok(None), not Err match soup.find(".missing")? { Some(tag) => println!("Found: {}", tag.text()), None => println!("Not found"), } }
Memory Efficiency
scrape-rs minimizes memory usage through:
String Interning
Tag names and attribute names are interned:
#![allow(unused)] fn main() { // Many <div> elements share the same "div" string let divs = soup.find_all("div")?; // Memory: O(1) for tag names, not O(n) }
Compact Node Representation
Nodes use space-efficient layouts:
- Element: 64 bytes
- Text: 40 bytes
- Comment: 40 bytes
For comparison, BeautifulSoup's Python objects use 200-400 bytes per node.
Zero-copy Text
Text content is stored as references into the original HTML string when possible, avoiding duplication.
Thread Safety
DOM Types
Soup and Tag are !Send and !Sync by design:
#![allow(unused)] fn main() { // This won't compile: let soup = Soup::parse(html); std::thread::spawn(move || { soup.find("div"); // Error: Soup is not Send }); }
Rationale: DOM uses interior mutability for caching, unsafe to share.
Parallel Processing
Use the parallel module for multi-threaded parsing:
#![allow(unused)] fn main() { use scrape_core::parallel::parse_batch; // Parses documents in parallel let results = parse_batch(&documents)?; }
Each document gets its own thread-local DOM.
Configuration Options
Customize parsing behavior:
#![allow(unused)] fn main() { use scrape_core::SoupConfig; let config = SoupConfig::builder() .max_depth(256) // Limit nesting depth .strict_mode(true) // Fail on malformed HTML .preserve_whitespace(true) // Keep whitespace-only text nodes .include_comments(true) // Include comment nodes .build(); let soup = Soup::parse_with_config(html, config); }
max_depth
Limits DOM tree depth to prevent stack overflow on deeply nested HTML.
Default: 512
strict_mode
When enabled, parsing fails on malformed HTML instead of attempting recovery.
Default: false (forgiving mode)
preserve_whitespace
Keeps text nodes that contain only whitespace.
Default: false (whitespace-only nodes removed)
include_comments
Includes HTML comments in the DOM tree.
Default: false (comments ignored)
Next Steps
Now that you understand the core concepts:
Parsing HTML
This chapter covers different ways to parse HTML documents with scrape-rs.
Basic Parsing
The simplest way to parse HTML is with Soup::parse():
#![allow(unused)] fn main() { use scrape_core::Soup; let html = "<html><body><h1>Hello</h1></body></html>"; let soup = Soup::parse(html); }
This uses default configuration and is suitable for most use cases.
Parsing Configuration
Customize parsing behavior with SoupConfig:
#![allow(unused)] fn main() { use scrape_core::{Soup, SoupConfig}; let config = SoupConfig::builder() .max_depth(256) .preserve_whitespace(true) .include_comments(true) .build(); let soup = Soup::parse_with_config(html, config); }
Configuration Options
max_depth
Maximum nesting depth for DOM tree. Default: 512
#![allow(unused)] fn main() { let config = SoupConfig::builder() .max_depth(128) .build(); }
Use cases:
- Prevent stack overflow on malicious HTML
- Limit resource usage
- Enforce document structure constraints
preserve_whitespace
Whether to keep whitespace-only text nodes. Default: false
#![allow(unused)] fn main() { let config = SoupConfig::builder() .preserve_whitespace(true) .build(); }
When enabled:
<div>
<span>Text</span>
</div>
Preserves the newline and spaces around <span>.
When disabled (default), whitespace-only text nodes are removed.
include_comments
Whether to include comment nodes in DOM. Default: false
#![allow(unused)] fn main() { let config = SoupConfig::builder() .include_comments(true) .build(); }
Useful for:
- Processing conditional comments
- Extracting metadata from comments
- Preserving comments in modified HTML
Fragment Parsing
Parse HTML fragments without wrapping in <html><body>:
#![allow(unused)] fn main() { let soup = Soup::parse_fragment("<span>A</span><span>B</span>"); }
Fragment parsing:
- Does not add
<html>or<body>wrappers - Parses as if content appeared inside
<body> - Useful for processing snippets
Context Element
Specify parsing context for special elements:
#![allow(unused)] fn main() { // Parse table rows without <table> wrapper let soup = Soup::parse_fragment_with_context("<tr><td>Data</td></tr>", "tbody"); }
Common contexts:
"body"(default): Standard HTML elements"table": Allows<tr>without<tbody>"tbody": Allows<tr>directly"tr": Allows<td>directly"select": Allows<option>directly
Parsing from File
Read and parse from filesystem:
#![allow(unused)] fn main() { use std::path::Path; use scrape_core::Soup; let soup = Soup::from_file(Path::new("index.html"))?; }
For large files, consider streaming instead:
#![allow(unused)] fn main() { use scrape_core::{StreamingSoup, StreamingConfig}; let mut streaming = StreamingSoup::new(); // Register handlers... streaming.parse_file("large.html")?; }
Parser Modes
DOM Parser (Default)
Builds complete document tree in memory:
#![allow(unused)] fn main() { let soup = Soup::parse(html); }
Characteristics:
- Memory usage: O(n) where n = document size
- Allows random access
- Supports tree navigation (parent, siblings)
- Can query multiple times
- Best for documents < 10MB
Streaming Parser
Processes HTML incrementally with callbacks:
#![allow(unused)] fn main() { use scrape_core::StreamingSoup; let mut streaming = StreamingSoup::new(); streaming.on_element("a[href]", |el| { if let Some(href) = el.get_attribute("href") { println!("Link: {}", href); } Ok(()) })?; streaming.write(html.as_bytes())?; streaming.end()?; }
Characteristics:
- Memory usage: O(1) constant
- Sequential processing only
- No tree navigation
- One-pass extraction
- Best for documents > 100MB
Streaming parsing will be covered in Phase 20 Week 2.
Encoding
scrape-rs expects UTF-8 input. If your HTML uses a different encoding, convert first:
#![allow(unused)] fn main() { use encoding_rs::WINDOWS_1252; let (decoded, _, _) = WINDOWS_1252.decode(bytes); let soup = Soup::parse(&decoded); }
For automatic encoding detection:
#![allow(unused)] fn main() { use chardet::detect; let (encoding_name, _confidence) = detect(bytes); // Use encoding_rs to decode... }
Malformed HTML
scrape-rs handles malformed HTML gracefully:
Unclosed Tags
<div>
<span>Content
</div>
Parser automatically closes <span> before closing <div>.
Misnested Tags
<b><i>Text</b></i>
Parser restructures to valid nesting:
<b><i>Text</i></b><i></i>
Invalid Attributes
<div class"value">
Parser ignores malformed attributes but continues parsing.
Strict Mode
Enable strict mode to fail on malformed HTML:
#![allow(unused)] fn main() { let config = SoupConfig::builder() .strict_mode(true) .build(); match Soup::parse_with_config(bad_html, config) { Ok(soup) => { /* ... */ } Err(e) => eprintln!("Parse error: {}", e), } }
Parse Warnings
Access parse warnings from Phase 19:
#![allow(unused)] fn main() { use scrape_core::parser::{Html5everParser, Parser}; let parser = Html5everParser; let result = parser.parse_with_warnings(html)?; for warning in result.warnings() { println!("Warning: {} at line {}", warning.message(), warning.line()); } let document = result.into_document(); }
Warnings include:
- Unexpected end tag
- Misnested tags
- Invalid attributes
- Encoding issues
Performance Considerations
Pre-allocation
For known document size, pre-allocate arena:
#![allow(unused)] fn main() { use scrape_core::parser::{Html5everParser, Parser, ParseConfig}; let parser = Html5everParser; let config = ParseConfig::default(); let estimated_nodes = html.len() / 50; // Rough estimate let document = parser.parse_with_config_and_capacity( html, &config, estimated_nodes )?; }
Benefits:
- Reduces allocation overhead
- Improves parse speed by ~10-15%
- Useful when parsing many similar documents
Streaming for Large Documents
For documents over 100MB, use streaming:
#![allow(unused)] fn main() { let mut streaming = StreamingSoup::new(); // Process in constant memory }
Next Steps
- Learn about Querying elements
Querying Elements
This chapter covers finding and selecting elements using CSS selectors.
Finding Elements
find() - First Match
Find the first element matching a selector:
#![allow(unused)] fn main() { use scrape_core::Soup; let soup = Soup::parse(html); // Returns Ok(Some(tag)) if found, Ok(None) if not found match soup.find("div.product")? { Some(tag) => println!("Found: {}", tag.text()), None => println!("Not found"), } }
find_all() - All Matches
Find all elements matching a selector:
#![allow(unused)] fn main() { let tags = soup.find_all("div.product")?; for tag in tags { println!("Product: {}", tag.text()); } }
CSS Selector Syntax
Basic Selectors
#![allow(unused)] fn main() { // Type selector - matches tag name soup.find("div")? // Class selector soup.find(".product")? // ID selector soup.find("#header")? // Multiple classes soup.find(".product.featured")? // Compound selector soup.find("div.product")? }
Attribute Selectors
#![allow(unused)] fn main() { // Has attribute soup.find("[href]")? // Exact value soup.find("[type='text']")? // Contains word soup.find("[class~='active']")? // Starts with soup.find("[href^='https://']")? // Ends with soup.find("[src$='.png']")? // Contains substring soup.find("[href*='example']")? // Case-insensitive soup.find("[type='TEXT' i]")? }
Combinators
#![allow(unused)] fn main() { // Descendant - any level soup.find("div span")? // Child - direct children only soup.find("ul > li")? // Adjacent sibling - next element soup.find("h1 + p")? // General sibling - following elements soup.find("h1 ~ p")? }
Pseudo-classes
#![allow(unused)] fn main() { // First/last child soup.find("li:first-child")? soup.find("li:last-child")? // Nth child soup.find("li:nth-child(2)")? // Second soup.find("li:nth-child(2n)")? // Even soup.find("li:nth-child(2n+1)")? // Odd soup.find("li:nth-child(odd)")? // Odd (shorthand) soup.find("li:nth-child(even)")? // Even (shorthand) // Empty elements soup.find("div:empty")? // Negation soup.find("input:not([type='hidden'])")? }
Compiled Selectors
For repeated queries, compile the selector once:
#![allow(unused)] fn main() { use scrape_core::compile_selector; let selector = compile_selector("div.product")?; // Reuse for multiple documents for html in documents { let soup = Soup::parse(html); let products = soup.find_all_compiled(&selector)?; // Process products... } }
Performance improvement: ~50% faster for complex selectors
Selector Explanation
Use explain() to understand selector performance:
#![allow(unused)] fn main() { use scrape_core::explain; let explanation = explain("div.product > span.price")?; println!("Specificity: {:?}", explanation.specificity()); println!("Optimization hints: {:?}", explanation.hints()); }
With document context:
#![allow(unused)] fn main() { use scrape_core::explain_with_document; let soup = Soup::parse(html); let explanation = explain_with_document("div.product", soup.document())?; println!("Matches: {}", explanation.match_count()); println!("Estimated cost: {}", explanation.estimated_cost()); }
Scoped Queries
Query within a specific element:
#![allow(unused)] fn main() { let container = soup.find("#products")?.unwrap(); // Find within container only let products = container.find_all(".product")?; for product in products { let name = product.find(".name")?.unwrap().text(); println!("Product: {}", name); } }
Error Handling
#![allow(unused)] fn main() { use scrape_core::Error; match soup.find("div[invalid") { Err(Error::InvalidSelector { selector }) => { eprintln!("Bad selector: {}", selector); } Ok(Some(tag)) => { // Process tag } Ok(None) => { // Not found } } }
Performance Tips
- Use ID selectors when possible (O(1) lookup)
- Prefer child combinator (
>) over descendant - Compile selectors for reuse
- Use
find()instead offind_all()when only one result needed
Next Steps
- Read more in the Parsing guide
Migration Overview
scrape-rs provides a consistent API across platforms while being significantly faster than existing HTML parsing libraries. This guide helps you migrate from other popular libraries.
Why Migrate?
Performance
scrape-rs is 10-50x faster than most HTML parsing libraries:
| Library | Language | Parse 100KB | Query | Extract Links |
|---|---|---|---|---|
| BeautifulSoup | Python | 18ms | 0.80ms | 3.2ms |
| lxml | Python | 8ms | 0.15ms | 1.1ms |
| Cheerio | Node.js | 12ms | 0.12ms | 0.85ms |
| scraper | Rust | 3.2ms | 0.015ms | 0.22ms |
| scrape-rs | All | 1.8ms | 0.006ms | 0.18ms |
Memory Efficiency
scrape-rs uses arena allocation with compact node representation:
| Library | Memory per Node | 100KB Document |
|---|---|---|
| BeautifulSoup | ~300 bytes | ~15 MB |
| lxml | ~150 bytes | ~7.5 MB |
| Cheerio | ~200 bytes | ~10 MB |
| scrape-rs | ~50 bytes | ~2.5 MB |
Cross-Platform Consistency
Same API across Rust, Python, Node.js, and WASM:
# Python
soup = Soup(html)
div = soup.find("div.product")
// Node.js - identical API
const soup = new Soup(html);
const div = soup.find("div.product");
#![allow(unused)] fn main() { // Rust - identical API let soup = Soup::parse(html); let div = soup.find("div.product")?; }
Migration Guides
Detailed migration guides for the following libraries are coming in Phase 20 Week 2:
- BeautifulSoup (Python) - 10-50x performance improvement
- Cheerio (Node.js) - 6-20x performance improvement
- lxml (Python) - Simpler API with comparable HTML performance
- scraper Crate (Rust) - Better performance with cross-platform bindings
When to migrate:
- Need Python/Node.js bindings
- Want streaming support
- Need better performance for large documents
Compatibility: ~80% API compatible
Migration Strategy
1. Side-by-Side Testing
Run both libraries in parallel during development:
# Python example
from bs4 import BeautifulSoup
from scrape_rs import Soup
# Parse with both
soup_bs4 = BeautifulSoup(html, 'lxml')
soup_scrape = Soup(html)
# Compare results
result_bs4 = soup_bs4.find("div", class_="product").text
result_scrape = soup_scrape.find("div.product").text
assert result_bs4.strip() == result_scrape.strip()
2. Gradual Rollout
Start with non-critical code paths:
USE_SCRAPE_RS = os.getenv("USE_SCRAPE_RS", "false") == "true"
if USE_SCRAPE_RS:
from scrape_rs import Soup as Parser
else:
from bs4 import BeautifulSoup as Parser
soup = Parser(html)
3. Performance Testing
Benchmark before and after migration:
import time
from scrape_rs import Soup
start = time.time()
for html in documents:
soup = Soup(html)
results = soup.find_all("div.product")
# Process results...
end = time.time()
print(f"Processed {len(documents)} documents in {end - start:.2f}s")
Common Patterns
Query Syntax Differences
| Pattern | BeautifulSoup/lxml | Cheerio | scrape-rs |
|---|---|---|---|
| Find by class | find(class_="item") | $(".item") | find(".item") |
| Find by id | find(id="header") | $("#header") | find("#header") |
| Find by tag | find("div") | $("div") | find("div") |
| Find all | find_all("div") | $("div") | find_all("div") |
Text Extraction
| Library | Method | Whitespace Handling |
|---|---|---|
| BeautifulSoup | .get_text() or .text | Manual strip() needed |
| Cheerio | .text() | Automatic trim |
| lxml | .text_content() | Manual strip() needed |
| scrape-rs | .text | Automatic normalize |
Attribute Access
| Library | Get Attribute | Check Existence |
|---|---|---|
| BeautifulSoup | tag.get("href") or tag["href"] | tag.has_attr("href") |
| Cheerio | elem.attr("href") | elem.attr("href") !== undefined |
| lxml | elem.get("href") | "href" in elem.attrib |
| scrape-rs | tag.get("href") | tag.has_attr("href") |
Known Limitations
Not Supported
scrape-rs intentionally does not support:
-
DOM Modification: The DOM is immutable after parsing
- No
.append(),.insert(),.remove()methods - Use HTML rewriting for output modification
- No
-
XML Parsing: Only HTML5 parsing is supported
- No XML namespaces
- No XML declaration handling
-
Encoding Detection: Input must be UTF-8
- Use chardet/encoding libraries before parsing
- Or convert to UTF-8 first
Performance Trade-offs
scrape-rs optimizes for:
- Parse speed
- Query speed
- Memory efficiency
At the cost of:
- No modification support
- Immutable DOM only
Getting Help
If you encounter issues during migration:
- Read the API documentation
- Check the Getting Started guide
- Open an issue on GitHub
Detailed migration guides for BeautifulSoup, Cheerio, lxml, and scraper are coming in Phase 20 Week 2.