Introduction

Welcome to the scrape-rs documentation. scrape-rs is a high-performance, cross-platform HTML parsing library with a pure Rust core and bindings for Python, Node.js, and WASM.

Why scrape-rs?

scrape-rs is designed to be 10-50x faster than popular HTML parsing libraries while maintaining a consistent, idiomatic API across all platforms.

Key Features

  • Blazing Fast: Built on html5ever with SIMD-accelerated text processing
  • Cross-Platform: Identical API for Rust, Python, Node.js, and WASM
  • Memory Efficient: Arena-based DOM allocation with minimal overhead
  • Spec-Compliant: Full HTML5 parsing with comprehensive CSS selector support
  • Modern: Support for streaming parsing, compiled selectors, and parallel processing

Performance Highlights

OperationBeautifulSoupCheerioscrape-rsSpeedup
Parse 1KB HTML0.23ms0.18ms0.024ms9.7-7.5x
Parse 100KB HTML18ms12ms1.8ms10-6.7x
CSS selector query0.80ms0.12ms0.006ms133-20x
Extract all links3.2ms0.85ms0.18ms17.8-4.7x

Quick Example

Rust

#![allow(unused)]
fn main() {
use scrape_core::Soup;

let html = r#"<div class="product"><h2>Widget</h2><span class="price">$19.99</span></div>"#;
let soup = Soup::parse(html);

let product = soup.find(".product")?.expect("product not found");
let name = product.find("h2")?.expect("name not found").text();
let price = product.find(".price")?.expect("price not found").text();

println!("{}: {}", name, price);
}

Python

from scrape_rs import Soup

html = '<div class="product"><h2>Widget</h2><span class="price">$19.99</span></div>'
soup = Soup(html)

product = soup.find(".product")
name = product.find("h2").text
price = product.find(".price").text

print(f"{name}: {price}")

Node.js

import { Soup } from '@scrape-rs/scrape';

const html = '<div class="product"><h2>Widget</h2><span class="price">$19.99</span></div>';
const soup = new Soup(html);

const product = soup.find(".product");
const name = product.find("h2").text;
const price = product.find(".price").text;

console.log(`${name}: ${price}`);

WASM

import init, { Soup } from '@scrape-rs/wasm';

await init();

const html = '<div class="product"><h2>Widget</h2><span class="price">$19.99</span></div>';
const soup = new Soup(html);

const product = soup.find(".product");
const name = product.find("h2").text;
const price = product.find(".price").text;

console.log(`${name}: ${price}`);

Where to Go Next

Platform Support

PlatformStatusPackage
RustStablescrape-core
Python 3.10+Stablefast-scrape
Node.js 18+Stable@scrape-rs/scrape
WASMStable@scrape-rs/wasm

License

scrape-rs is dual-licensed under Apache 2.0 and MIT. See LICENSE-APACHE and LICENSE-MIT for details.

Installation

scrape-rs provides bindings for multiple platforms. Choose the installation method for your platform:

Rust

Add scrape-core to your Cargo.toml:

[dependencies]
scrape-core = "0.2"

Or use cargo add:

cargo add scrape-core

Feature Flags

scrape-core supports optional features:

[dependencies]
scrape-core = { version = "0.2", features = ["streaming", "parallel", "simd"] }
FeatureDescriptionDefault
streamingEnable streaming parser with constant memory usageNo
parallelEnable parallel batch processing with RayonNo
simdEnable SIMD-accelerated text processingNo
serdeEnable serialization supportNo

Python

Install via pip:

pip install fast-scrape

Or with uv:

uv pip install fast-scrape

Requirements

  • Python 3.10 or later
  • Supported platforms: Linux (x86_64, aarch64), macOS (x86_64, aarch64), Windows (x86_64)

Virtual Environment

We recommend using a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fast-scrape

Node.js

Install via npm:

npm install @scrape-rs/scrape

Or with pnpm:

pnpm add @scrape-rs/scrape

Or with yarn:

yarn add @scrape-rs/scrape

Requirements

  • Node.js 18 or later
  • Supported platforms: Linux (x86_64, aarch64), macOS (x86_64, aarch64), Windows (x86_64)

TypeScript Support

TypeScript types are included automatically. No additional @types package is needed.

WASM (Browser)

Install via npm:

npm install @scrape-rs/wasm

Or with pnpm:

pnpm add @scrape-rs/wasm

Usage in Browser

import init, { Soup } from '@scrape-rs/wasm';

// Initialize WASM module (required once)
await init();

const soup = new Soup('<html>...</html>');

Requirements

  • Modern browser with WASM support (Chrome 57+, Firefox 52+, Safari 11+, Edge 16+)
  • Bundle size: ~400KB (gzipped: ~120KB)

Webpack Configuration

If using Webpack, add to your config:

module.exports = {
  experiments: {
    asyncWebAssembly: true,
  },
};

Vite Configuration

If using Vite, add vite-plugin-wasm:

npm install vite-plugin-wasm
import { defineConfig } from 'vite';
import wasm from 'vite-plugin-wasm';

export default defineConfig({
  plugins: [wasm()],
});

Verifying Installation

After installation, verify it works:

Rust

cargo run --example basic

Or create a test file:

use scrape_core::Soup;

fn main() {
    let soup = Soup::parse("<html><body><h1>Hello</h1></body></html>");
    println!("{:?}", soup.find("h1"));
}

Python

python -c "from scrape_rs import Soup; print(Soup('<h1>Test</h1>').find('h1').text)"

Node.js

node -e "const {Soup} = require('@scrape-rs/scrape'); console.log(new Soup('<h1>Test</h1>').find('h1').text)"

WASM

Create a test HTML file:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>scrape-rs WASM Test</title>
</head>
<body>
    <script type="module">
        import init, { Soup } from './node_modules/@scrape-rs/wasm/scrape_wasm.js';

        await init();
        const soup = new Soup('<h1>Hello WASM</h1>');
        const h1 = soup.find('h1');
        console.log('Success:', h1.text);
        document.body.innerHTML = `<p>Result: ${h1.text}</p>`;
    </script>
</body>
</html>

Troubleshooting

Rust: Compilation Errors

If you see compilation errors, ensure you're using Rust 1.75 or later:

rustc --version
rustup update

Python: No Matching Distribution

If you get "no matching distribution found", ensure you're using Python 3.10+:

python --version

If on an unsupported platform, you can build from source:

pip install maturin
git clone https://github.com/bug-ops/scrape-rs.git
cd scrape-rs/crates/scrape-py
maturin develop --release

Node.js: Binary Not Found

If the native module fails to load, ensure your platform is supported:

node -p "process.platform + '-' + process.arch"

Supported: linux-x64, linux-arm64, darwin-x64, darwin-arm64, win32-x64

WASM: Module Not Found

Ensure your bundler is configured to handle WASM files. See platform-specific configuration above.

Next Steps

Now that you have scrape-rs installed, proceed to the Quick Start guide to learn the basics.

Quick Start

This guide will get you parsing and querying HTML in under 5 minutes.

Your First Program

Rust

use scrape_core::Soup;

fn main() {
    let html = r#"
        <html>
            <body>
                <div class="product">
                    <h2>Laptop</h2>
                    <span class="price">$999</span>
                </div>
            </body>
        </html>
    "#;

    let soup = Soup::parse(html);

    if let Ok(Some(product)) = soup.find(".product") {
        let name = product.find("h2")
            .ok()
            .flatten()
            .map(|t| t.text())
            .unwrap_or_default();

        let price = product.find(".price")
            .ok()
            .flatten()
            .map(|t| t.text())
            .unwrap_or_default();

        println!("Product: {}, Price: {}", name, price);
    }
}

Output:

Product: Laptop, Price: $999

Python

from scrape_rs import Soup

html = """
<html>
    <body>
        <div class="product">
            <h2>Laptop</h2>
            <span class="price">$999</span>
        </div>
    </body>
</html>
"""

soup = Soup(html)
product = soup.find(".product")

if product:
    name = product.find("h2").text
    price = product.find(".price").text
    print(f"Product: {name}, Price: {price}")

Node.js

import { Soup } from '@scrape-rs/scrape';

const html = `
<html>
    <body>
        <div class="product">
            <h2>Laptop</h2>
            <span class="price">$999</span>
        </div>
    </body>
</html>
`;

const soup = new Soup(html);
const product = soup.find(".product");

if (product) {
    const name = product.find("h2").text;
    const price = product.find(".price").text;
    console.log(`Product: ${name}, Price: ${price}`);
}

WASM

import init, { Soup } from '@scrape-rs/wasm';

await init();

const html = `
<html>
    <body>
        <div class="product">
            <h2>Laptop</h2>
            <span class="price">$999</span>
        </div>
    </body>
</html>
`;

const soup = new Soup(html);
const product = soup.find(".product");

if (product) {
    const name = product.find("h2").text;
    const price = product.find(".price").text;
    console.log(`Product: ${name}, Price: ${price}`);
}

Core Concepts

Parsing

scrape-rs parses HTML into a document object model (DOM):

#![allow(unused)]
fn main() {
let soup = Soup::parse(html_string);
}

The parser is:

  • Spec-compliant: Uses html5ever for HTML5 parsing
  • Forgiving: Handles malformed HTML gracefully
  • Fast: Parses 100KB in ~2ms on modern hardware

Finding Elements

Use CSS selectors to find elements:

#![allow(unused)]
fn main() {
// Find first matching element
let element = soup.find("div.product")?;

// Find all matching elements
let elements = soup.find_all("div.product")?;
}

Supported selectors:

  • Type: div, span, a
  • Class: .product, .price
  • ID: #main, #header
  • Attributes: [href], [type="text"]
  • Combinators: div > span, h1 + p, div span
  • Pseudo-classes: :first-child, :last-child, :nth-child(2n)

Extracting Data

Once you have an element, extract its content:

#![allow(unused)]
fn main() {
let tag = soup.find("h1")?.unwrap();

// Get text content
let text = tag.text();

// Get HTML content
let html = tag.html();

// Get attribute value
if let Some(href) = tag.get("href") {
    println!("Link: {}", href);
}

// Check if attribute exists
if tag.has_attr("data-id") {
    // ...
}

// Check for CSS class
if tag.has_class("active") {
    // ...
}
}

Traverse the DOM tree:

#![allow(unused)]
fn main() {
let tag = soup.find("span")?.unwrap();

// Parent element
if let Some(parent) = tag.parent() {
    println!("Parent: {}", parent.name().unwrap());
}

// Children
for child in tag.children() {
    println!("Child: {:?}", child.name());
}

// Next sibling
if let Some(next) = tag.next_sibling() {
    println!("Next: {:?}", next.name());
}

// Previous sibling
if let Some(prev) = tag.prev_sibling() {
    println!("Previous: {:?}", prev.name());
}
}

Common Patterns

#![allow(unused)]
fn main() {
let soup = Soup::parse(html);

for link in soup.find_all("a[href]")? {
    if let Some(href) = link.get("href") {
        println!("{}", href);
    }
}
}

Extract Table Data

#![allow(unused)]
fn main() {
let soup = Soup::parse(html);

if let Ok(Some(table)) = soup.find("table") {
    for row in table.find_all("tr")? {
        let cells: Vec<String> = row
            .find_all("td")?
            .iter()
            .map(|cell| cell.text())
            .collect();
        println!("{:?}", cells);
    }
}
}

Filter by Attribute

#![allow(unused)]
fn main() {
let soup = Soup::parse(html);

// Find all images with alt text
for img in soup.find_all("img[alt]")? {
    let src = img.get("src").unwrap_or("");
    let alt = img.get("alt").unwrap_or("");
    println!("{}: {}", src, alt);
}
}

Extract Nested Data

#![allow(unused)]
fn main() {
let soup = Soup::parse(html);

for article in soup.find_all("article.post")? {
    let title = article
        .find(".title")?
        .map(|t| t.text())
        .unwrap_or_default();

    let author = article
        .find(".author")?
        .map(|t| t.text())
        .unwrap_or_default();

    let date = article
        .find("time")?
        .and_then(|t| t.get("datetime"))
        .unwrap_or("");

    println!("{} by {} on {}", title, author, date);
}
}

Error Handling

scrape-rs uses Result for fallible operations:

Rust

#![allow(unused)]
fn main() {
use scrape_core::{Soup, Error};

fn extract_title(html: &str) -> Result<String, Error> {
    let soup = Soup::parse(html);
    let title = soup
        .find("title")?
        .ok_or_else(|| Error::not_found("title"))?;
    Ok(title.text())
}
}

Python

from scrape_rs import Soup, ScrapeError

def extract_title(html):
    try:
        soup = Soup(html)
        title = soup.find("title")
        if not title:
            raise ScrapeError("Title not found")
        return title.text
    except ScrapeError as e:
        print(f"Error: {e}")
        return None

Node.js

import { Soup, ScrapeError } from '@scrape-rs/scrape';

function extractTitle(html: string): string | null {
    try {
        const soup = new Soup(html);
        const title = soup.find("title");
        if (!title) {
            throw new Error("Title not found");
        }
        return title.text;
    } catch (error) {
        console.error(`Error: ${error}`);
        return null;
    }
}

Performance Tips

For best performance:

  1. Reuse compiled selectors when querying many documents:
#![allow(unused)]
fn main() {
use scrape_core::compile_selector;

let selector = compile_selector("div.product")?;

for html in documents {
    let soup = Soup::parse(html);
    let products = soup.find_all_compiled(&selector)?;
    // Process products...
}
}
  1. Use streaming for large files (requires streaming feature):
#![allow(unused)]
fn main() {
use scrape_core::{StreamingSoup, StreamingConfig};

let mut streaming = StreamingSoup::with_config(
    StreamingConfig::default()
);

streaming.on_element("a[href]", |el| {
    println!("Found link: {:?}", el.get_attribute("href"));
    Ok(())
})?;
}
  1. Process in parallel (requires parallel feature):
#![allow(unused)]
fn main() {
use scrape_core::parallel::parse_batch;

let results = parse_batch(&html_documents)?;
}

Next Steps

Now that you know the basics:

Core Concepts

This chapter explains the fundamental concepts behind scrape-rs.

Document Object Model (DOM)

When you parse HTML, scrape-rs builds a Document Object Model (DOM) - a tree representation of the HTML structure.

<html>
  <body>
    <div class="container">
      <h1>Title</h1>
      <p>Paragraph</p>
    </div>
  </body>
</html>

This becomes:

Document
└── html
    └── body
        └── div.container
            ├── h1 ("Title")
            └── p ("Paragraph")

Arena Allocation

scrape-rs uses arena allocation for DOM nodes, which provides:

  • Fast allocation: All nodes allocated in contiguous memory
  • No reference counting: Zero overhead compared to Rc<RefCell<T>>
  • Cache-friendly: Better CPU cache utilization
  • Safe lifetimes: Rust's borrow checker prevents dangling references

Node Types

The DOM contains three types of nodes:

  1. Element nodes: HTML tags (<div>, <span>, etc.)
  2. Text nodes: Raw text content
  3. Comment nodes: HTML comments (if include_comments is enabled)

Parsing Modes

scrape-rs offers two parsing approaches:

DOM Parser

The default parser builds the entire document tree in memory:

#![allow(unused)]
fn main() {
let soup = Soup::parse(html);
}

Best for:

  • Documents under 10MB
  • Random access to elements
  • Tree navigation (parent, siblings)
  • Multiple queries on the same document

Memory usage: O(n) where n is document size

Streaming Parser

The streaming parser processes HTML incrementally:

#![allow(unused)]
fn main() {
let mut streaming = StreamingSoup::new();
streaming.on_element("a", |el| {
    println!("Link: {:?}", el.get_attribute("href"));
    Ok(())
})?;
}

Best for:

  • Large documents (100MB+)
  • Sequential processing
  • One-pass extraction
  • Memory-constrained environments

Memory usage: O(1) constant memory

CSS Selectors

CSS selectors are patterns for matching elements:

Basic Selectors

#![allow(unused)]
fn main() {
// Type selector - matches tag name
soup.find("div")?

// Class selector - matches class attribute
soup.find(".product")?

// ID selector - matches id attribute
soup.find("#header")?

// Universal selector - matches all elements
soup.find("*")?
}

Attribute Selectors

#![allow(unused)]
fn main() {
// Has attribute
soup.find("[href]")?

// Exact match
soup.find("[type='text']")?

// Contains word
soup.find("[class~='active']")?

// Starts with
soup.find("[href^='https://']")?

// Ends with
soup.find("[src$='.png']")?

// Contains substring
soup.find("[href*='example']")?
}

Combinators

#![allow(unused)]
fn main() {
// Descendant - any level deep
soup.find("div span")?

// Child - direct children only
soup.find("ul > li")?

// Adjacent sibling - immediately following
soup.find("h1 + p")?

// General sibling - any following sibling
soup.find("h1 ~ p")?
}

Pseudo-classes

#![allow(unused)]
fn main() {
// First/last child
soup.find("li:first-child")?
soup.find("li:last-child")?

// Nth child
soup.find("li:nth-child(2)")?      // Second child
soup.find("li:nth-child(2n)")?     // Even children
soup.find("li:nth-child(2n+1)")?   // Odd children

// Empty elements
soup.find("div:empty")?

// Negation
soup.find("input:not([type='hidden'])")?
}

Selector Performance

Different selectors have different performance characteristics:

SelectorComplexityNotes
#idO(1)Uses ID index
.classO(n)Linear scan with early exit
tagO(n)Linear scan
[attr]O(n)Linear scan
div > spanO(n)Depends on tree depth
div spanO(n²)Checks all descendants

Optimization tips:

  • Start with ID selector when possible: #container .item not .item
  • Use child combinator (>) instead of descendant when appropriate
  • Compile selectors for reuse

Compiled Selectors

For repeated queries, compile the selector once:

#![allow(unused)]
fn main() {
use scrape_core::compile_selector;

// Compile once
let selector = compile_selector("div.product")?;

// Reuse many times
for html in documents {
    let soup = Soup::parse(html);
    let products = soup.find_all_compiled(&selector)?;
    // Process...
}
}

Performance benefit: ~50% faster for complex selectors

Element References

Elements are represented by the Tag type:

#![allow(unused)]
fn main() {
let tag = soup.find("div")?.unwrap();
}

Lifetime Relationship

Tag borrows from the Soup:

#![allow(unused)]
fn main() {
let soup = Soup::parse(html);
let tag = soup.find("div")?.unwrap();  // Borrows from soup
// soup cannot be modified or dropped while tag is in scope
}

This prevents dangling references at compile time.

Copy Semantics

Tag implements Copy, so it can be duplicated cheaply:

#![allow(unused)]
fn main() {
let tag1 = soup.find("div")?.unwrap();
let tag2 = tag1;  // Copies, both valid
}

The copy is just a reference (pointer + ID), not the actual element data.

Text Extraction

Text can be extracted in different ways:

Direct Text

#![allow(unused)]
fn main() {
// Just the text of this element's direct text nodes
let text = tag.text();
}

Deep Text

#![allow(unused)]
fn main() {
// Text of this element and all descendants
let all_text = tag.text();  // Default behavior
}

Normalized Text

Text extraction automatically:

  • Collapses multiple spaces into one
  • Trims leading/trailing whitespace
  • Converts newlines to spaces

To preserve whitespace, use configuration:

#![allow(unused)]
fn main() {
let config = SoupConfig::builder()
    .preserve_whitespace(true)
    .build();
let soup = Soup::parse_with_config(html, config);
}

Error Handling

scrape-rs uses explicit error types:

QueryError

Returned when a CSS selector is invalid:

#![allow(unused)]
fn main() {
match soup.find("div[") {
    Err(Error::InvalidSelector { selector }) => {
        println!("Bad selector: {}", selector);
    }
    _ => {}
}
}

NotFound

Not an error - use Option:

#![allow(unused)]
fn main() {
// Returns Ok(Some(tag)) or Ok(None), not Err
match soup.find(".missing")? {
    Some(tag) => println!("Found: {}", tag.text()),
    None => println!("Not found"),
}
}

Memory Efficiency

scrape-rs minimizes memory usage through:

String Interning

Tag names and attribute names are interned:

#![allow(unused)]
fn main() {
// Many <div> elements share the same "div" string
let divs = soup.find_all("div")?;
// Memory: O(1) for tag names, not O(n)
}

Compact Node Representation

Nodes use space-efficient layouts:

  • Element: 64 bytes
  • Text: 40 bytes
  • Comment: 40 bytes

For comparison, BeautifulSoup's Python objects use 200-400 bytes per node.

Zero-copy Text

Text content is stored as references into the original HTML string when possible, avoiding duplication.

Thread Safety

DOM Types

Soup and Tag are !Send and !Sync by design:

#![allow(unused)]
fn main() {
// This won't compile:
let soup = Soup::parse(html);
std::thread::spawn(move || {
    soup.find("div");  // Error: Soup is not Send
});
}

Rationale: DOM uses interior mutability for caching, unsafe to share.

Parallel Processing

Use the parallel module for multi-threaded parsing:

#![allow(unused)]
fn main() {
use scrape_core::parallel::parse_batch;

// Parses documents in parallel
let results = parse_batch(&documents)?;
}

Each document gets its own thread-local DOM.

Configuration Options

Customize parsing behavior:

#![allow(unused)]
fn main() {
use scrape_core::SoupConfig;

let config = SoupConfig::builder()
    .max_depth(256)              // Limit nesting depth
    .strict_mode(true)           // Fail on malformed HTML
    .preserve_whitespace(true)   // Keep whitespace-only text nodes
    .include_comments(true)      // Include comment nodes
    .build();

let soup = Soup::parse_with_config(html, config);
}

max_depth

Limits DOM tree depth to prevent stack overflow on deeply nested HTML.

Default: 512

strict_mode

When enabled, parsing fails on malformed HTML instead of attempting recovery.

Default: false (forgiving mode)

preserve_whitespace

Keeps text nodes that contain only whitespace.

Default: false (whitespace-only nodes removed)

include_comments

Includes HTML comments in the DOM tree.

Default: false (comments ignored)

Next Steps

Now that you understand the core concepts:

Parsing HTML

This chapter covers different ways to parse HTML documents with scrape-rs.

Basic Parsing

The simplest way to parse HTML is with Soup::parse():

#![allow(unused)]
fn main() {
use scrape_core::Soup;

let html = "<html><body><h1>Hello</h1></body></html>";
let soup = Soup::parse(html);
}

This uses default configuration and is suitable for most use cases.

Parsing Configuration

Customize parsing behavior with SoupConfig:

#![allow(unused)]
fn main() {
use scrape_core::{Soup, SoupConfig};

let config = SoupConfig::builder()
    .max_depth(256)
    .preserve_whitespace(true)
    .include_comments(true)
    .build();

let soup = Soup::parse_with_config(html, config);
}

Configuration Options

max_depth

Maximum nesting depth for DOM tree. Default: 512

#![allow(unused)]
fn main() {
let config = SoupConfig::builder()
    .max_depth(128)
    .build();
}

Use cases:

  • Prevent stack overflow on malicious HTML
  • Limit resource usage
  • Enforce document structure constraints

preserve_whitespace

Whether to keep whitespace-only text nodes. Default: false

#![allow(unused)]
fn main() {
let config = SoupConfig::builder()
    .preserve_whitespace(true)
    .build();
}

When enabled:

<div>
    <span>Text</span>
</div>

Preserves the newline and spaces around <span>.

When disabled (default), whitespace-only text nodes are removed.

include_comments

Whether to include comment nodes in DOM. Default: false

#![allow(unused)]
fn main() {
let config = SoupConfig::builder()
    .include_comments(true)
    .build();
}

Useful for:

  • Processing conditional comments
  • Extracting metadata from comments
  • Preserving comments in modified HTML

Fragment Parsing

Parse HTML fragments without wrapping in <html><body>:

#![allow(unused)]
fn main() {
let soup = Soup::parse_fragment("<span>A</span><span>B</span>");
}

Fragment parsing:

  • Does not add <html> or <body> wrappers
  • Parses as if content appeared inside <body>
  • Useful for processing snippets

Context Element

Specify parsing context for special elements:

#![allow(unused)]
fn main() {
// Parse table rows without <table> wrapper
let soup = Soup::parse_fragment_with_context("<tr><td>Data</td></tr>", "tbody");
}

Common contexts:

  • "body" (default): Standard HTML elements
  • "table": Allows <tr> without <tbody>
  • "tbody": Allows <tr> directly
  • "tr": Allows <td> directly
  • "select": Allows <option> directly

Parsing from File

Read and parse from filesystem:

#![allow(unused)]
fn main() {
use std::path::Path;
use scrape_core::Soup;

let soup = Soup::from_file(Path::new("index.html"))?;
}

For large files, consider streaming instead:

#![allow(unused)]
fn main() {
use scrape_core::{StreamingSoup, StreamingConfig};

let mut streaming = StreamingSoup::new();
// Register handlers...
streaming.parse_file("large.html")?;
}

Parser Modes

DOM Parser (Default)

Builds complete document tree in memory:

#![allow(unused)]
fn main() {
let soup = Soup::parse(html);
}

Characteristics:

  • Memory usage: O(n) where n = document size
  • Allows random access
  • Supports tree navigation (parent, siblings)
  • Can query multiple times
  • Best for documents < 10MB

Streaming Parser

Processes HTML incrementally with callbacks:

#![allow(unused)]
fn main() {
use scrape_core::StreamingSoup;

let mut streaming = StreamingSoup::new();

streaming.on_element("a[href]", |el| {
    if let Some(href) = el.get_attribute("href") {
        println!("Link: {}", href);
    }
    Ok(())
})?;

streaming.write(html.as_bytes())?;
streaming.end()?;
}

Characteristics:

  • Memory usage: O(1) constant
  • Sequential processing only
  • No tree navigation
  • One-pass extraction
  • Best for documents > 100MB

Streaming parsing will be covered in Phase 20 Week 2.

Encoding

scrape-rs expects UTF-8 input. If your HTML uses a different encoding, convert first:

#![allow(unused)]
fn main() {
use encoding_rs::WINDOWS_1252;

let (decoded, _, _) = WINDOWS_1252.decode(bytes);
let soup = Soup::parse(&decoded);
}

For automatic encoding detection:

#![allow(unused)]
fn main() {
use chardet::detect;

let (encoding_name, _confidence) = detect(bytes);
// Use encoding_rs to decode...
}

Malformed HTML

scrape-rs handles malformed HTML gracefully:

Unclosed Tags

<div>
    <span>Content
</div>

Parser automatically closes <span> before closing <div>.

Misnested Tags

<b><i>Text</b></i>

Parser restructures to valid nesting:

<b><i>Text</i></b><i></i>

Invalid Attributes

<div class"value">

Parser ignores malformed attributes but continues parsing.

Strict Mode

Enable strict mode to fail on malformed HTML:

#![allow(unused)]
fn main() {
let config = SoupConfig::builder()
    .strict_mode(true)
    .build();

match Soup::parse_with_config(bad_html, config) {
    Ok(soup) => { /* ... */ }
    Err(e) => eprintln!("Parse error: {}", e),
}
}

Parse Warnings

Access parse warnings from Phase 19:

#![allow(unused)]
fn main() {
use scrape_core::parser::{Html5everParser, Parser};

let parser = Html5everParser;
let result = parser.parse_with_warnings(html)?;

for warning in result.warnings() {
    println!("Warning: {} at line {}", warning.message(), warning.line());
}

let document = result.into_document();
}

Warnings include:

  • Unexpected end tag
  • Misnested tags
  • Invalid attributes
  • Encoding issues

Performance Considerations

Pre-allocation

For known document size, pre-allocate arena:

#![allow(unused)]
fn main() {
use scrape_core::parser::{Html5everParser, Parser, ParseConfig};

let parser = Html5everParser;
let config = ParseConfig::default();
let estimated_nodes = html.len() / 50;  // Rough estimate

let document = parser.parse_with_config_and_capacity(
    html,
    &config,
    estimated_nodes
)?;
}

Benefits:

  • Reduces allocation overhead
  • Improves parse speed by ~10-15%
  • Useful when parsing many similar documents

Streaming for Large Documents

For documents over 100MB, use streaming:

#![allow(unused)]
fn main() {
let mut streaming = StreamingSoup::new();
// Process in constant memory
}

Next Steps

Querying Elements

This chapter covers finding and selecting elements using CSS selectors.

Finding Elements

find() - First Match

Find the first element matching a selector:

#![allow(unused)]
fn main() {
use scrape_core::Soup;

let soup = Soup::parse(html);

// Returns Ok(Some(tag)) if found, Ok(None) if not found
match soup.find("div.product")? {
    Some(tag) => println!("Found: {}", tag.text()),
    None => println!("Not found"),
}
}

find_all() - All Matches

Find all elements matching a selector:

#![allow(unused)]
fn main() {
let tags = soup.find_all("div.product")?;

for tag in tags {
    println!("Product: {}", tag.text());
}
}

CSS Selector Syntax

Basic Selectors

#![allow(unused)]
fn main() {
// Type selector - matches tag name
soup.find("div")?

// Class selector
soup.find(".product")?

// ID selector
soup.find("#header")?

// Multiple classes
soup.find(".product.featured")?

// Compound selector
soup.find("div.product")?
}

Attribute Selectors

#![allow(unused)]
fn main() {
// Has attribute
soup.find("[href]")?

// Exact value
soup.find("[type='text']")?

// Contains word
soup.find("[class~='active']")?

// Starts with
soup.find("[href^='https://']")?

// Ends with
soup.find("[src$='.png']")?

// Contains substring
soup.find("[href*='example']")?

// Case-insensitive
soup.find("[type='TEXT' i]")?
}

Combinators

#![allow(unused)]
fn main() {
// Descendant - any level
soup.find("div span")?

// Child - direct children only
soup.find("ul > li")?

// Adjacent sibling - next element
soup.find("h1 + p")?

// General sibling - following elements
soup.find("h1 ~ p")?
}

Pseudo-classes

#![allow(unused)]
fn main() {
// First/last child
soup.find("li:first-child")?
soup.find("li:last-child")?

// Nth child
soup.find("li:nth-child(2)")?       // Second
soup.find("li:nth-child(2n)")?      // Even
soup.find("li:nth-child(2n+1)")?    // Odd
soup.find("li:nth-child(odd)")?     // Odd (shorthand)
soup.find("li:nth-child(even)")?    // Even (shorthand)

// Empty elements
soup.find("div:empty")?

// Negation
soup.find("input:not([type='hidden'])")?
}

Compiled Selectors

For repeated queries, compile the selector once:

#![allow(unused)]
fn main() {
use scrape_core::compile_selector;

let selector = compile_selector("div.product")?;

// Reuse for multiple documents
for html in documents {
    let soup = Soup::parse(html);
    let products = soup.find_all_compiled(&selector)?;
    // Process products...
}
}

Performance improvement: ~50% faster for complex selectors

Selector Explanation

Use explain() to understand selector performance:

#![allow(unused)]
fn main() {
use scrape_core::explain;

let explanation = explain("div.product > span.price")?;

println!("Specificity: {:?}", explanation.specificity());
println!("Optimization hints: {:?}", explanation.hints());
}

With document context:

#![allow(unused)]
fn main() {
use scrape_core::explain_with_document;

let soup = Soup::parse(html);
let explanation = explain_with_document("div.product", soup.document())?;

println!("Matches: {}", explanation.match_count());
println!("Estimated cost: {}", explanation.estimated_cost());
}

Scoped Queries

Query within a specific element:

#![allow(unused)]
fn main() {
let container = soup.find("#products")?.unwrap();

// Find within container only
let products = container.find_all(".product")?;

for product in products {
    let name = product.find(".name")?.unwrap().text();
    println!("Product: {}", name);
}
}

Error Handling

#![allow(unused)]
fn main() {
use scrape_core::Error;

match soup.find("div[invalid") {
    Err(Error::InvalidSelector { selector }) => {
        eprintln!("Bad selector: {}", selector);
    }
    Ok(Some(tag)) => {
        // Process tag
    }
    Ok(None) => {
        // Not found
    }
}
}

Performance Tips

  1. Use ID selectors when possible (O(1) lookup)
  2. Prefer child combinator (>) over descendant
  3. Compile selectors for reuse
  4. Use find() instead of find_all() when only one result needed

Next Steps

Migration Overview

scrape-rs provides a consistent API across platforms while being significantly faster than existing HTML parsing libraries. This guide helps you migrate from other popular libraries.

Why Migrate?

Performance

scrape-rs is 10-50x faster than most HTML parsing libraries:

LibraryLanguageParse 100KBQueryExtract Links
BeautifulSoupPython18ms0.80ms3.2ms
lxmlPython8ms0.15ms1.1ms
CheerioNode.js12ms0.12ms0.85ms
scraperRust3.2ms0.015ms0.22ms
scrape-rsAll1.8ms0.006ms0.18ms

Memory Efficiency

scrape-rs uses arena allocation with compact node representation:

LibraryMemory per Node100KB Document
BeautifulSoup~300 bytes~15 MB
lxml~150 bytes~7.5 MB
Cheerio~200 bytes~10 MB
scrape-rs~50 bytes~2.5 MB

Cross-Platform Consistency

Same API across Rust, Python, Node.js, and WASM:

# Python
soup = Soup(html)
div = soup.find("div.product")
// Node.js - identical API
const soup = new Soup(html);
const div = soup.find("div.product");
#![allow(unused)]
fn main() {
// Rust - identical API
let soup = Soup::parse(html);
let div = soup.find("div.product")?;
}

Migration Guides

Detailed migration guides for the following libraries are coming in Phase 20 Week 2:

  • BeautifulSoup (Python) - 10-50x performance improvement
  • Cheerio (Node.js) - 6-20x performance improvement
  • lxml (Python) - Simpler API with comparable HTML performance
  • scraper Crate (Rust) - Better performance with cross-platform bindings

When to migrate:

  • Need Python/Node.js bindings
  • Want streaming support
  • Need better performance for large documents

Compatibility: ~80% API compatible

Migration Strategy

1. Side-by-Side Testing

Run both libraries in parallel during development:

# Python example
from bs4 import BeautifulSoup
from scrape_rs import Soup

# Parse with both
soup_bs4 = BeautifulSoup(html, 'lxml')
soup_scrape = Soup(html)

# Compare results
result_bs4 = soup_bs4.find("div", class_="product").text
result_scrape = soup_scrape.find("div.product").text

assert result_bs4.strip() == result_scrape.strip()

2. Gradual Rollout

Start with non-critical code paths:

USE_SCRAPE_RS = os.getenv("USE_SCRAPE_RS", "false") == "true"

if USE_SCRAPE_RS:
    from scrape_rs import Soup as Parser
else:
    from bs4 import BeautifulSoup as Parser

soup = Parser(html)

3. Performance Testing

Benchmark before and after migration:

import time
from scrape_rs import Soup

start = time.time()
for html in documents:
    soup = Soup(html)
    results = soup.find_all("div.product")
    # Process results...
end = time.time()

print(f"Processed {len(documents)} documents in {end - start:.2f}s")

Common Patterns

Query Syntax Differences

PatternBeautifulSoup/lxmlCheerioscrape-rs
Find by classfind(class_="item")$(".item")find(".item")
Find by idfind(id="header")$("#header")find("#header")
Find by tagfind("div")$("div")find("div")
Find allfind_all("div")$("div")find_all("div")

Text Extraction

LibraryMethodWhitespace Handling
BeautifulSoup.get_text() or .textManual strip() needed
Cheerio.text()Automatic trim
lxml.text_content()Manual strip() needed
scrape-rs.textAutomatic normalize

Attribute Access

LibraryGet AttributeCheck Existence
BeautifulSouptag.get("href") or tag["href"]tag.has_attr("href")
Cheerioelem.attr("href")elem.attr("href") !== undefined
lxmlelem.get("href")"href" in elem.attrib
scrape-rstag.get("href")tag.has_attr("href")

Known Limitations

Not Supported

scrape-rs intentionally does not support:

  1. DOM Modification: The DOM is immutable after parsing

    • No .append(), .insert(), .remove() methods
    • Use HTML rewriting for output modification
  2. XML Parsing: Only HTML5 parsing is supported

    • No XML namespaces
    • No XML declaration handling
  3. Encoding Detection: Input must be UTF-8

    • Use chardet/encoding libraries before parsing
    • Or convert to UTF-8 first

Performance Trade-offs

scrape-rs optimizes for:

  • Parse speed
  • Query speed
  • Memory efficiency

At the cost of:

  • No modification support
  • Immutable DOM only

Getting Help

If you encounter issues during migration:

  1. Read the API documentation
  2. Check the Getting Started guide
  3. Open an issue on GitHub

Detailed migration guides for BeautifulSoup, Cheerio, lxml, and scraper are coming in Phase 20 Week 2.

Rust API Docs