Structured Data Crawler avatar
Structured Data Crawler
Under maintenance

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Structured Data Crawler

Structured Data Crawler

Under maintenance

Crawl public web pages and convert unstructured content into clean, deterministic, schema-first structured records.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Lone

Lone

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

12 days ago

Last modified

Share

Crawl public web pages and convert unstructured HTML content into clean, deterministic, schema-first structured records.

This Actor is designed for teams that need reliable, reusable data, not raw pages. It extracts declared fields from public web pages and emits normalized records suitable for analytics, indexing, or downstream automation.


What This Actor Does

  • Crawls public, accessible HTML pages
  • Extracts predefined structured fields
  • Normalizes outputs into a stable schema
  • Deduplicates records deterministically
  • Produces one record per page

This Actor is schema-first, not exploratory.


Core Guarantees

  • Same input → same output (deterministic)
  • Fields are never omitted (nullable allowed)
  • No hidden defaults or inference
  • No external services or APIs
  • No LLM usage

If a value cannot be extracted, it is explicitly set to null.


Input Parameters

Preset (Optional)

  • preset
    • none (default)
    • articles-v0.1

Built-in presets allow regression testing and validation.


Seed URLs

  • seedUrlsText
    One URL per line to crawl. Required when preset is none.

Crawl Options

  • maxPages
    Maximum number of pages to process (default: 10)

  • dedupe
    Enable deterministic deduplication using content hash (default: true)


When enabled implicitly:

  • Links are extracted from seed pages only
  • Only same-domain URLs are followed
  • Crawl depth is strictly limited to 1
  • No recursion or infinite crawling

This increases coverage while preserving determinism.


Output Schema

Each dataset row contains:

  • title — Page title (or null)
  • url — Page URL
  • domain — Source domain
  • publishedDate — ISO date or null
  • author — Author name or null
  • entityTypearticle
  • summaryText — Meta description or null
  • sourceHash — SHA-256 content hash
  • extractedAt — ISO timestamp

All fields are always present.


Supported Content

  • Public articles
  • Case study pages
  • Documentation pages
  • Reports hosted as HTML

What This Actor Does NOT Do

  • No JavaScript rendering
  • No authenticated or paywalled access
  • No LLM enrichment or summarization
  • No classification or inference
  • No infinite crawling

This Actor prioritizes trust and stability over breadth.


  • Only public, accessible pages are processed
  • No personal data intentionally stored
  • No cookies, sessions, or tracking
  • GDPR-friendly by design

Performance

  • Typical runtime: under 2 minutes for small crawls
  • Memory usage: under 4 GB
  • Deterministic and unattended execution
  • Suitable for scheduled and automated runs

Versioning

  • v0.1 — Single-page structured extraction
  • v0.2 — Same-domain link discovery (depth = 1)

Future changes will be versioned explicitly to preserve schema trust.


Final Note

Most crawlers extract pages.
This Actor extracts records.

It exists to create data surfaces reliable enough to build systems on — without needing to re-scrape or reinterpret the source content.