Migration Overview
scrape-rs provides a consistent API across platforms while being significantly faster than existing HTML parsing libraries. This guide helps you migrate from other popular libraries.
Why Migrate?
Performance
scrape-rs is 10-50x faster than most HTML parsing libraries:
| Library | Language | Parse 100KB | Query | Extract Links |
|---|---|---|---|---|
| BeautifulSoup | Python | 18ms | 0.80ms | 3.2ms |
| lxml | Python | 8ms | 0.15ms | 1.1ms |
| Cheerio | Node.js | 12ms | 0.12ms | 0.85ms |
| scraper | Rust | 3.2ms | 0.015ms | 0.22ms |
| scrape-rs | All | 1.8ms | 0.006ms | 0.18ms |
Memory Efficiency
scrape-rs uses arena allocation with compact node representation:
| Library | Memory per Node | 100KB Document |
|---|---|---|
| BeautifulSoup | ~300 bytes | ~15 MB |
| lxml | ~150 bytes | ~7.5 MB |
| Cheerio | ~200 bytes | ~10 MB |
| scrape-rs | ~50 bytes | ~2.5 MB |
Cross-Platform Consistency
Same API across Rust, Python, Node.js, and WASM:
# Python
soup = Soup(html)
div = soup.find("div.product")
// Node.js - identical API
const soup = new Soup(html);
const div = soup.find("div.product");
#![allow(unused)] fn main() { // Rust - identical API let soup = Soup::parse(html); let div = soup.find("div.product")?; }
Migration Guides
Detailed migration guides for the following libraries are coming in Phase 20 Week 2:
- BeautifulSoup (Python) - 10-50x performance improvement
- Cheerio (Node.js) - 6-20x performance improvement
- lxml (Python) - Simpler API with comparable HTML performance
- scraper Crate (Rust) - Better performance with cross-platform bindings
When to migrate:
- Need Python/Node.js bindings
- Want streaming support
- Need better performance for large documents
Compatibility: ~80% API compatible
Migration Strategy
1. Side-by-Side Testing
Run both libraries in parallel during development:
# Python example
from bs4 import BeautifulSoup
from scrape_rs import Soup
# Parse with both
soup_bs4 = BeautifulSoup(html, 'lxml')
soup_scrape = Soup(html)
# Compare results
result_bs4 = soup_bs4.find("div", class_="product").text
result_scrape = soup_scrape.find("div.product").text
assert result_bs4.strip() == result_scrape.strip()
2. Gradual Rollout
Start with non-critical code paths:
USE_SCRAPE_RS = os.getenv("USE_SCRAPE_RS", "false") == "true"
if USE_SCRAPE_RS:
from scrape_rs import Soup as Parser
else:
from bs4 import BeautifulSoup as Parser
soup = Parser(html)
3. Performance Testing
Benchmark before and after migration:
import time
from scrape_rs import Soup
start = time.time()
for html in documents:
soup = Soup(html)
results = soup.find_all("div.product")
# Process results...
end = time.time()
print(f"Processed {len(documents)} documents in {end - start:.2f}s")
Common Patterns
Query Syntax Differences
| Pattern | BeautifulSoup/lxml | Cheerio | scrape-rs |
|---|---|---|---|
| Find by class | find(class_="item") | $(".item") | find(".item") |
| Find by id | find(id="header") | $("#header") | find("#header") |
| Find by tag | find("div") | $("div") | find("div") |
| Find all | find_all("div") | $("div") | find_all("div") |
Text Extraction
| Library | Method | Whitespace Handling |
|---|---|---|
| BeautifulSoup | .get_text() or .text | Manual strip() needed |
| Cheerio | .text() | Automatic trim |
| lxml | .text_content() | Manual strip() needed |
| scrape-rs | .text | Automatic normalize |
Attribute Access
| Library | Get Attribute | Check Existence |
|---|---|---|
| BeautifulSoup | tag.get("href") or tag["href"] | tag.has_attr("href") |
| Cheerio | elem.attr("href") | elem.attr("href") !== undefined |
| lxml | elem.get("href") | "href" in elem.attrib |
| scrape-rs | tag.get("href") | tag.has_attr("href") |
Known Limitations
Not Supported
scrape-rs intentionally does not support:
-
DOM Modification: The DOM is immutable after parsing
- No
.append(),.insert(),.remove()methods - Use HTML rewriting for output modification
- No
-
XML Parsing: Only HTML5 parsing is supported
- No XML namespaces
- No XML declaration handling
-
Encoding Detection: Input must be UTF-8
- Use chardet/encoding libraries before parsing
- Or convert to UTF-8 first
Performance Trade-offs
scrape-rs optimizes for:
- Parse speed
- Query speed
- Memory efficiency
At the cost of:
- No modification support
- Immutable DOM only
Getting Help
If you encounter issues during migration:
- Read the API documentation
- Check the Getting Started guide
- Open an issue on GitHub
Detailed migration guides for BeautifulSoup, Cheerio, lxml, and scraper are coming in Phase 20 Week 2.