Parsing HTML
This chapter covers different ways to parse HTML documents with scrape-rs.
Basic Parsing
The simplest way to parse HTML is with Soup::parse():
#![allow(unused)] fn main() { use scrape_core::Soup; let html = "<html><body><h1>Hello</h1></body></html>"; let soup = Soup::parse(html); }
This uses default configuration and is suitable for most use cases.
Parsing Configuration
Customize parsing behavior with SoupConfig:
#![allow(unused)] fn main() { use scrape_core::{Soup, SoupConfig}; let config = SoupConfig::builder() .max_depth(256) .preserve_whitespace(true) .include_comments(true) .build(); let soup = Soup::parse_with_config(html, config); }
Configuration Options
max_depth
Maximum nesting depth for DOM tree. Default: 512
#![allow(unused)] fn main() { let config = SoupConfig::builder() .max_depth(128) .build(); }
Use cases:
- Prevent stack overflow on malicious HTML
- Limit resource usage
- Enforce document structure constraints
preserve_whitespace
Whether to keep whitespace-only text nodes. Default: false
#![allow(unused)] fn main() { let config = SoupConfig::builder() .preserve_whitespace(true) .build(); }
When enabled:
<div>
<span>Text</span>
</div>
Preserves the newline and spaces around <span>.
When disabled (default), whitespace-only text nodes are removed.
include_comments
Whether to include comment nodes in DOM. Default: false
#![allow(unused)] fn main() { let config = SoupConfig::builder() .include_comments(true) .build(); }
Useful for:
- Processing conditional comments
- Extracting metadata from comments
- Preserving comments in modified HTML
Fragment Parsing
Parse HTML fragments without wrapping in <html><body>:
#![allow(unused)] fn main() { let soup = Soup::parse_fragment("<span>A</span><span>B</span>"); }
Fragment parsing:
- Does not add
<html>or<body>wrappers - Parses as if content appeared inside
<body> - Useful for processing snippets
Context Element
Specify parsing context for special elements:
#![allow(unused)] fn main() { // Parse table rows without <table> wrapper let soup = Soup::parse_fragment_with_context("<tr><td>Data</td></tr>", "tbody"); }
Common contexts:
"body"(default): Standard HTML elements"table": Allows<tr>without<tbody>"tbody": Allows<tr>directly"tr": Allows<td>directly"select": Allows<option>directly
Parsing from File
Read and parse from filesystem:
#![allow(unused)] fn main() { use std::path::Path; use scrape_core::Soup; let soup = Soup::from_file(Path::new("index.html"))?; }
For large files, consider streaming instead:
#![allow(unused)] fn main() { use scrape_core::{StreamingSoup, StreamingConfig}; let mut streaming = StreamingSoup::new(); // Register handlers... streaming.parse_file("large.html")?; }
Parser Modes
DOM Parser (Default)
Builds complete document tree in memory:
#![allow(unused)] fn main() { let soup = Soup::parse(html); }
Characteristics:
- Memory usage: O(n) where n = document size
- Allows random access
- Supports tree navigation (parent, siblings)
- Can query multiple times
- Best for documents < 10MB
Streaming Parser
Processes HTML incrementally with callbacks:
#![allow(unused)] fn main() { use scrape_core::StreamingSoup; let mut streaming = StreamingSoup::new(); streaming.on_element("a[href]", |el| { if let Some(href) = el.get_attribute("href") { println!("Link: {}", href); } Ok(()) })?; streaming.write(html.as_bytes())?; streaming.end()?; }
Characteristics:
- Memory usage: O(1) constant
- Sequential processing only
- No tree navigation
- One-pass extraction
- Best for documents > 100MB
Streaming parsing will be covered in Phase 20 Week 2.
Encoding
scrape-rs expects UTF-8 input. If your HTML uses a different encoding, convert first:
#![allow(unused)] fn main() { use encoding_rs::WINDOWS_1252; let (decoded, _, _) = WINDOWS_1252.decode(bytes); let soup = Soup::parse(&decoded); }
For automatic encoding detection:
#![allow(unused)] fn main() { use chardet::detect; let (encoding_name, _confidence) = detect(bytes); // Use encoding_rs to decode... }
Malformed HTML
scrape-rs handles malformed HTML gracefully:
Unclosed Tags
<div>
<span>Content
</div>
Parser automatically closes <span> before closing <div>.
Misnested Tags
<b><i>Text</b></i>
Parser restructures to valid nesting:
<b><i>Text</i></b><i></i>
Invalid Attributes
<div class"value">
Parser ignores malformed attributes but continues parsing.
Strict Mode
Enable strict mode to fail on malformed HTML:
#![allow(unused)] fn main() { let config = SoupConfig::builder() .strict_mode(true) .build(); match Soup::parse_with_config(bad_html, config) { Ok(soup) => { /* ... */ } Err(e) => eprintln!("Parse error: {}", e), } }
Parse Warnings
Access parse warnings from Phase 19:
#![allow(unused)] fn main() { use scrape_core::parser::{Html5everParser, Parser}; let parser = Html5everParser; let result = parser.parse_with_warnings(html)?; for warning in result.warnings() { println!("Warning: {} at line {}", warning.message(), warning.line()); } let document = result.into_document(); }
Warnings include:
- Unexpected end tag
- Misnested tags
- Invalid attributes
- Encoding issues
Performance Considerations
Pre-allocation
For known document size, pre-allocate arena:
#![allow(unused)] fn main() { use scrape_core::parser::{Html5everParser, Parser, ParseConfig}; let parser = Html5everParser; let config = ParseConfig::default(); let estimated_nodes = html.len() / 50; // Rough estimate let document = parser.parse_with_config_and_capacity( html, &config, estimated_nodes )?; }
Benefits:
- Reduces allocation overhead
- Improves parse speed by ~10-15%
- Useful when parsing many similar documents
Streaming for Large Documents
For documents over 100MB, use streaming:
#![allow(unused)] fn main() { let mut streaming = StreamingSoup::new(); // Process in constant memory }
Next Steps
- Learn about Querying elements