Structured Data Crawler
Pricing
from $0.01 / 1,000 results
Structured Data Crawler
Crawl public web pages and convert unstructured content into clean, deterministic, schema-first structured records.
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

Lone
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
12 days ago
Last modified
Categories
Share
Crawl public web pages and convert unstructured HTML content into clean, deterministic, schema-first structured records.
This Actor is designed for teams that need reliable, reusable data, not raw pages. It extracts declared fields from public web pages and emits normalized records suitable for analytics, indexing, or downstream automation.
What This Actor Does
- Crawls public, accessible HTML pages
- Extracts predefined structured fields
- Normalizes outputs into a stable schema
- Deduplicates records deterministically
- Produces one record per page
This Actor is schema-first, not exploratory.
Core Guarantees
- Same input → same output (deterministic)
- Fields are never omitted (nullable allowed)
- No hidden defaults or inference
- No external services or APIs
- No LLM usage
If a value cannot be extracted, it is explicitly set to null.
Input Parameters
Preset (Optional)
- preset
none(default)articles-v0.1
Built-in presets allow regression testing and validation.
Seed URLs
- seedUrlsText
One URL per line to crawl. Required whenpresetisnone.
Crawl Options
-
maxPages
Maximum number of pages to process (default: 10) -
dedupe
Enable deterministic deduplication using content hash (default: true)
Link Discovery (Depth = 1)
When enabled implicitly:
- Links are extracted from seed pages only
- Only same-domain URLs are followed
- Crawl depth is strictly limited to 1
- No recursion or infinite crawling
This increases coverage while preserving determinism.
Output Schema
Each dataset row contains:
- title — Page title (or
null) - url — Page URL
- domain — Source domain
- publishedDate — ISO date or
null - author — Author name or
null - entityType —
article - summaryText — Meta description or
null - sourceHash — SHA-256 content hash
- extractedAt — ISO timestamp
All fields are always present.
Supported Content
- Public articles
- Case study pages
- Documentation pages
- Reports hosted as HTML
What This Actor Does NOT Do
- No JavaScript rendering
- No authenticated or paywalled access
- No LLM enrichment or summarization
- No classification or inference
- No infinite crawling
This Actor prioritizes trust and stability over breadth.
Legal & Compliance
- Only public, accessible pages are processed
- No personal data intentionally stored
- No cookies, sessions, or tracking
- GDPR-friendly by design
Performance
- Typical runtime: under 2 minutes for small crawls
- Memory usage: under 4 GB
- Deterministic and unattended execution
- Suitable for scheduled and automated runs
Versioning
- v0.1 — Single-page structured extraction
- v0.2 — Same-domain link discovery (depth = 1)
Future changes will be versioned explicitly to preserve schema trust.
Final Note
Most crawlers extract pages.
This Actor extracts records.
It exists to create data surfaces reliable enough to build systems on — without needing to re-scrape or reinterpret the source content.
