Web Crawler & Semantic Schema-Enhanced Extractor avatar
Web Crawler & Semantic Schema-Enhanced Extractor

Pricing

from $9.99 / 1,000 records

Go to Apify Store
Web Crawler & Semantic Schema-Enhanced Extractor

Web Crawler & Semantic Schema-Enhanced Extractor

Depth-controlled web crawler that transforms websites into structured analytics-ready data. Starting from one or more URLs, it crawls internal links up to a configurable depth and outputs detailed JSON records per page

Pricing

from $9.99 / 1,000 records

Rating

5.0

(7)

Developer

DataFusionX

DataFusionX

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

4

Monthly active users

2 days ago

Last modified

Share

๐Ÿค– Semantic Web Crawler & Schema-Enhanced Extractor

The Semantic Web Crawler is the ultimate tool for transforming arbitrary websites into structured, analytics-ready datasetsโ€”without requiring custom code per site. It performs a depth-controlled crawl, intelligently renders pages, and extracts meaningful semantic signals and structured data (schema.org) from every page.

It's designed to give your team consistent, rich, and measurable data about a websiteโ€™s structure and content quality for use in SEO, data science, and research.


โœจ Why Use This Actor? (The Value Proposition)

This Actor moves beyond basic text scraping to provide context and structure, which is vital for modern data workflows.

  • SEO & Content Strategy: Map information architecture, internal links, and content depth. Identify thin/filler content and improve topical coverage.
  • Data & Analytics: Build site-wide corpora with consistent features for dashboards, trend analysis, and ML/NLP tasks. Benchmark and compare competing sites.
  • Product & Research: Power search and recommendations with clean text and semantic cues. Validate the presence and quality of schema.org structured data.

๐Ÿš€ Main Features & Data Extraction

The core value lies in the rich, standardized JSON record output for every successfully crawled page.

๐Ÿงญ Controlled & Flexible Crawling

The Actor is built for resilience and control:

  • Depth-Limited Crawling: Starts from one or more startUrls and follows internal links up to a configurable maxDepth.
  • Resilience & Performance: Includes granular controls for maxConcurrency, maxRetries, and timeouts.
  • Proxy Support: Full integration with Apify Proxy (via groups like RESIDENTIAL) and custom proxy URLs.

๐Ÿ“Š Rich Extraction Per Page

The final output JSON record contains the following comprehensive data:

  • Structured Data: schema_json_ld: Automatically collected, merged, and cleaned schema.org JSON-LD data.
  • Content: markdown_content: Clean, structured content output in Markdown format. clean_text: Noise-reduced text content and a unique content_hash for change detection.
  • Semantic Structure: Detailed link graph, headings hierarchy, tables, and lists.
  • Content Blocks: Categorization of content into per-block types (e.g., heading, paragraph, list, table, quote) with word counts and link/image presence.
  • Metrics: text_metrics: Word, sentence, and paragraph counts, averages, and normalized top keywords. technical_metrics: HTTP status code and HTML size.

โš™๏ธ How to Use (Quick Start)

The Actor requires minimal setup. The only mandatory setting is the list of websites to start crawling.

1. Set Start URLs

Specify the entry point(s) for the crawl.

2. Configure Crawl Depth

Set maxDepth to define how many link clicks deep the crawler should go (e.g., 1 for just the homepage links, 5 for deep site analysis).

3. Run with Example Input

Use the following JSON structure to crawl apify.com up to depth 5 using Residential proxies:

{
"maxConcurrency": 5,
"maxDepth": 5,
"maxLinksPerPage": 5,
"maxRetries": 3,
"proxy": {
"useApifyProxy": true,
"apifyProxyGroups": [
"RESIDENTIAL"
]
},
"startUrls": [
"https://www.apify.com/"
]
}

๐Ÿ“‹ Input Schema (Full Parameters)

FieldTypeDefaultDescription
startUrlsArray of URLsN/AREQUIRED. The starting point(s) for the crawl.
maxDepthInteger1Maximum link depth to crawl from the start URL.
maxLinksPerPageInteger50Maximum number of internal links to queue from any single page.
maxConcurrencyInteger5Maximum number of pages to process simultaneously. Lower this for smaller sites.
maxRetriesInteger2Maximum number of times to retry a failed request.
proxyObjectN/AProxy settings (useApifyProxy, groups, countryCode).

๐Ÿ“ Data Output Structure (Sample Record)

The Actor saves one JSON record per page to the Dataset. This output is standardized for seamless downstream integration.

[
{
"url": "https://www.apify.com/actor-name",
"technical_metrics": {
"status_code": 200,
"html_size_kb": 120
},
"clean_text": "This is the noise-reduced text content of the page...",
"content_hash": "a54f0d67b...",
"markdown_content": "# Actor Name\n\nThis is the markdown version of the content...",
"schema_json_ld": {
"@context": "https://schema.org",
"@type": "WebPage",
"name": "Actor Page Title",
"topLevel": "WebPage"
},
"semantic_structure": {
"headings": { "h1": 1, "h2": 3, "h3": 5 },
"links": { "internal_count": 45, "external_count": 12 }
},
"text_metrics": {
"word_count": 520,
"top_keywords": ["actor", "data", "web", "crawl"]
},
"content_blocks": [
{ "type": "heading", "word_count": 3, "label": "Main Feature" },
{ "type": "paragraph", "word_count": 45 }
]
}
]

๐Ÿ› ๏ธ Technical Notes

  • Compliance: The crawler respects the target site's robots.txt rules and crawl delay directives.
  • Efficiency: URLs are normalized, and redirected destinations are logged to ensure no unnecessary reprocessing occurs.
  • Content Hash: The content_hash field allows you to easily implement change tracking and avoid reprocessing unchanged documents in subsequent runs.

๐Ÿ’ฌ Support and Contact

We are actively developing and improving this Semantic Web Crawler.

General Support & Feedback

If you encounter a bug, have a feature suggestion, or need help integrating the data:

  • Please open an Issue Ticket directly on the Apify platform.

Custom Solutions & Enterprise Use

For large-scale projects, custom integrations, or requirements that need bespoke development, consulting, or guaranteed support for this Actor, please contact our team directly at contact@datafusionnow.com for a custom quote and SLA. Contact LinkedIn