Web Crawler & Semantic Schema-Enhanced Extractor
Pricing
from $9.99 / 1,000 records
Web Crawler & Semantic Schema-Enhanced Extractor
Depth-controlled web crawler that transforms websites into structured analytics-ready data. Starting from one or more URLs, it crawls internal links up to a configurable depth and outputs detailed JSON records per page
Pricing
from $9.99 / 1,000 records
Rating
5.0
(7)
Developer

DataFusionX
Actor stats
0
Bookmarked
4
Total users
4
Monthly active users
2 days ago
Last modified
Categories
Share
๐ค Semantic Web Crawler & Schema-Enhanced Extractor
The Semantic Web Crawler is the ultimate tool for transforming arbitrary websites into structured, analytics-ready datasetsโwithout requiring custom code per site. It performs a depth-controlled crawl, intelligently renders pages, and extracts meaningful semantic signals and structured data (schema.org) from every page.
It's designed to give your team consistent, rich, and measurable data about a websiteโs structure and content quality for use in SEO, data science, and research.
โจ Why Use This Actor? (The Value Proposition)
This Actor moves beyond basic text scraping to provide context and structure, which is vital for modern data workflows.
- SEO & Content Strategy: Map information architecture, internal links, and content depth. Identify thin/filler content and improve topical coverage.
- Data & Analytics: Build site-wide corpora with consistent features for dashboards, trend analysis, and ML/NLP tasks. Benchmark and compare competing sites.
- Product & Research: Power search and recommendations with clean text and semantic cues. Validate the presence and quality of schema.org structured data.
๐ Main Features & Data Extraction
The core value lies in the rich, standardized JSON record output for every successfully crawled page.
๐งญ Controlled & Flexible Crawling
The Actor is built for resilience and control:
- Depth-Limited Crawling: Starts from one or more
startUrlsand follows internal links up to a configurablemaxDepth. - Resilience & Performance: Includes granular controls for
maxConcurrency,maxRetries, and timeouts. - Proxy Support: Full integration with Apify Proxy (via groups like
RESIDENTIAL) and custom proxy URLs.
๐ Rich Extraction Per Page
The final output JSON record contains the following comprehensive data:
- Structured Data:
schema_json_ld: Automatically collected, merged, and cleanedschema.orgJSON-LD data. - Content:
markdown_content: Clean, structured content output in Markdown format.clean_text: Noise-reduced text content and a uniquecontent_hashfor change detection. - Semantic Structure: Detailed link graph, headings hierarchy, tables, and lists.
- Content Blocks: Categorization of content into per-block types (e.g.,
heading,paragraph,list,table,quote) with word counts and link/image presence. - Metrics:
text_metrics: Word, sentence, and paragraph counts, averages, and normalized top keywords.technical_metrics: HTTP status code and HTML size.
โ๏ธ How to Use (Quick Start)
The Actor requires minimal setup. The only mandatory setting is the list of websites to start crawling.
1. Set Start URLs
Specify the entry point(s) for the crawl.
2. Configure Crawl Depth
Set maxDepth to define how many link clicks deep the crawler should go (e.g., 1 for just the homepage links, 5 for deep site analysis).
3. Run with Example Input
Use the following JSON structure to crawl apify.com up to depth 5 using Residential proxies:
{"maxConcurrency": 5,"maxDepth": 5,"maxLinksPerPage": 5,"maxRetries": 3,"proxy": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]},"startUrls": ["https://www.apify.com/"]}
๐ Input Schema (Full Parameters)
| Field | Type | Default | Description |
|---|---|---|---|
startUrls | Array of URLs | N/A | REQUIRED. The starting point(s) for the crawl. |
maxDepth | Integer | 1 | Maximum link depth to crawl from the start URL. |
maxLinksPerPage | Integer | 50 | Maximum number of internal links to queue from any single page. |
maxConcurrency | Integer | 5 | Maximum number of pages to process simultaneously. Lower this for smaller sites. |
maxRetries | Integer | 2 | Maximum number of times to retry a failed request. |
proxy | Object | N/A | Proxy settings (useApifyProxy, groups, countryCode). |
๐ Data Output Structure (Sample Record)
The Actor saves one JSON record per page to the Dataset. This output is standardized for seamless downstream integration.
[{"url": "https://www.apify.com/actor-name","technical_metrics": {"status_code": 200,"html_size_kb": 120},"clean_text": "This is the noise-reduced text content of the page...","content_hash": "a54f0d67b...","markdown_content": "# Actor Name\n\nThis is the markdown version of the content...","schema_json_ld": {"@context": "https://schema.org","@type": "WebPage","name": "Actor Page Title","topLevel": "WebPage"},"semantic_structure": {"headings": { "h1": 1, "h2": 3, "h3": 5 },"links": { "internal_count": 45, "external_count": 12 }},"text_metrics": {"word_count": 520,"top_keywords": ["actor", "data", "web", "crawl"]},"content_blocks": [{ "type": "heading", "word_count": 3, "label": "Main Feature" },{ "type": "paragraph", "word_count": 45 }]}]
๐ ๏ธ Technical Notes
- Compliance: The crawler respects the target site's
robots.txtrules and crawl delay directives. - Efficiency: URLs are normalized, and redirected destinations are logged to ensure no unnecessary reprocessing occurs.
- Content Hash: The
content_hashfield allows you to easily implement change tracking and avoid reprocessing unchanged documents in subsequent runs.
๐ฌ Support and Contact
We are actively developing and improving this Semantic Web Crawler.
General Support & Feedback
If you encounter a bug, have a feature suggestion, or need help integrating the data:
- Please open an Issue Ticket directly on the Apify platform.
Custom Solutions & Enterprise Use
For large-scale projects, custom integrations, or requirements that need bespoke development, consulting, or guaranteed support for this Actor, please contact our team directly at contact@datafusionnow.com for a custom quote and SLA.
Contact LinkedIn