Migration Overview

scrape-rs provides a consistent API across platforms while being significantly faster than existing HTML parsing libraries. This guide helps you migrate from other popular libraries.

Why Migrate?

Performance

scrape-rs is 10-50x faster than most HTML parsing libraries:

LibraryLanguageParse 100KBQueryExtract Links
BeautifulSoupPython18ms0.80ms3.2ms
lxmlPython8ms0.15ms1.1ms
CheerioNode.js12ms0.12ms0.85ms
scraperRust3.2ms0.015ms0.22ms
scrape-rsAll1.8ms0.006ms0.18ms

Memory Efficiency

scrape-rs uses arena allocation with compact node representation:

LibraryMemory per Node100KB Document
BeautifulSoup~300 bytes~15 MB
lxml~150 bytes~7.5 MB
Cheerio~200 bytes~10 MB
scrape-rs~50 bytes~2.5 MB

Cross-Platform Consistency

Same API across Rust, Python, Node.js, and WASM:

# Python
soup = Soup(html)
div = soup.find("div.product")
// Node.js - identical API
const soup = new Soup(html);
const div = soup.find("div.product");
#![allow(unused)]
fn main() {
// Rust - identical API
let soup = Soup::parse(html);
let div = soup.find("div.product")?;
}

Migration Guides

Detailed migration guides for the following libraries are coming in Phase 20 Week 2:

  • BeautifulSoup (Python) - 10-50x performance improvement
  • Cheerio (Node.js) - 6-20x performance improvement
  • lxml (Python) - Simpler API with comparable HTML performance
  • scraper Crate (Rust) - Better performance with cross-platform bindings

When to migrate:

  • Need Python/Node.js bindings
  • Want streaming support
  • Need better performance for large documents

Compatibility: ~80% API compatible

Migration Strategy

1. Side-by-Side Testing

Run both libraries in parallel during development:

# Python example
from bs4 import BeautifulSoup
from scrape_rs import Soup

# Parse with both
soup_bs4 = BeautifulSoup(html, 'lxml')
soup_scrape = Soup(html)

# Compare results
result_bs4 = soup_bs4.find("div", class_="product").text
result_scrape = soup_scrape.find("div.product").text

assert result_bs4.strip() == result_scrape.strip()

2. Gradual Rollout

Start with non-critical code paths:

USE_SCRAPE_RS = os.getenv("USE_SCRAPE_RS", "false") == "true"

if USE_SCRAPE_RS:
    from scrape_rs import Soup as Parser
else:
    from bs4 import BeautifulSoup as Parser

soup = Parser(html)

3. Performance Testing

Benchmark before and after migration:

import time
from scrape_rs import Soup

start = time.time()
for html in documents:
    soup = Soup(html)
    results = soup.find_all("div.product")
    # Process results...
end = time.time()

print(f"Processed {len(documents)} documents in {end - start:.2f}s")

Common Patterns

Query Syntax Differences

PatternBeautifulSoup/lxmlCheerioscrape-rs
Find by classfind(class_="item")$(".item")find(".item")
Find by idfind(id="header")$("#header")find("#header")
Find by tagfind("div")$("div")find("div")
Find allfind_all("div")$("div")find_all("div")

Text Extraction

LibraryMethodWhitespace Handling
BeautifulSoup.get_text() or .textManual strip() needed
Cheerio.text()Automatic trim
lxml.text_content()Manual strip() needed
scrape-rs.textAutomatic normalize

Attribute Access

LibraryGet AttributeCheck Existence
BeautifulSouptag.get("href") or tag["href"]tag.has_attr("href")
Cheerioelem.attr("href")elem.attr("href") !== undefined
lxmlelem.get("href")"href" in elem.attrib
scrape-rstag.get("href")tag.has_attr("href")

Known Limitations

Not Supported

scrape-rs intentionally does not support:

  1. DOM Modification: The DOM is immutable after parsing

    • No .append(), .insert(), .remove() methods
    • Use HTML rewriting for output modification
  2. XML Parsing: Only HTML5 parsing is supported

    • No XML namespaces
    • No XML declaration handling
  3. Encoding Detection: Input must be UTF-8

    • Use chardet/encoding libraries before parsing
    • Or convert to UTF-8 first

Performance Trade-offs

scrape-rs optimizes for:

  • Parse speed
  • Query speed
  • Memory efficiency

At the cost of:

  • No modification support
  • Immutable DOM only

Getting Help

If you encounter issues during migration:

  1. Read the API documentation
  2. Check the Getting Started guide
  3. Open an issue on GitHub

Detailed migration guides for BeautifulSoup, Cheerio, lxml, and scraper are coming in Phase 20 Week 2.