Web Crawler & Semantic Schema-Enhanced Extractor

Pricing

from $9.99 / 1,000 records

Try for free

Go to Apify Store

Web Crawler & Semantic Schema-Enhanced Extractor

Try for free

Depth-controlled web crawler that transforms websites into structured analytics-ready data. Starting from one or more URLs, it crawls internal links up to a configurable depth and outputs detailed JSON records per page

Pricing

from $9.99 / 1,000 records

Rating

5.0

(7)

Developer

DataFusionX

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

🤖 Semantic Web Crawler & Schema-Enhanced Extractor

The Semantic Web Crawler is the ultimate tool for transforming arbitrary websites into structured, analytics-ready datasets—without requiring custom code per site. It performs a depth-controlled crawl, intelligently renders pages, and extracts meaningful semantic signals and structured data (schema.org) from every page.

It's designed to give your team consistent, rich, and measurable data about a website’s structure and content quality for use in SEO, data science, and research.

✨ Why Use This Actor? (The Value Proposition)

This Actor moves beyond basic text scraping to provide context and structure, which is vital for modern data workflows.

SEO & Content Strategy: Map information architecture, internal links, and content depth. Identify thin/filler content and improve topical coverage.
Data & Analytics: Build site-wide corpora with consistent features for dashboards, trend analysis, and ML/NLP tasks. Benchmark and compare competing sites.
Product & Research: Power search and recommendations with clean text and semantic cues. Validate the presence and quality of schema.org structured data.

🚀 Main Features & Data Extraction

The core value lies in the rich, standardized JSON record output for every successfully crawled page.

🧭 Controlled & Flexible Crawling

The Actor is built for resilience and control:

Depth-Limited Crawling: Starts from one or more startUrls and follows internal links up to a configurable maxDepth.
Resilience & Performance: Includes granular controls for maxConcurrency, maxRetries, and timeouts.
Proxy Support: Full integration with Apify Proxy (via groups like RESIDENTIAL) and custom proxy URLs.

📊 Rich Extraction Per Page

The final output JSON record contains the following comprehensive data:

Structured Data: schema_json_ld: Automatically collected, merged, and cleaned schema.org JSON-LD data.
Content: markdown_content: Clean, structured content output in Markdown format. clean_text: Noise-reduced text content and a unique content_hash for change detection.
Semantic Structure: Detailed link graph, headings hierarchy, tables, and lists.
Content Blocks: Categorization of content into per-block types (e.g., heading, paragraph, list, table, quote) with word counts and link/image presence.
Metrics: text_metrics: Word, sentence, and paragraph counts, averages, and normalized top keywords. technical_metrics: HTTP status code and HTML size.

⚙️ How to Use (Quick Start)

The Actor requires minimal setup. The only mandatory setting is the list of websites to start crawling.

1. Set Start URLs

Specify the entry point(s) for the crawl.

2. Configure Crawl Depth

Set maxDepth to define how many link clicks deep the crawler should go (e.g., 1 for just the homepage links, 5 for deep site analysis).

3. Run with Example Input

Use the following JSON structure to crawl apify.com up to depth 5 using Residential proxies:

{
    "maxConcurrency": 5,
    "maxDepth": 5,
    "maxLinksPerPage": 5,
    "maxRetries": 3,
    "proxy": {
        "useApifyProxy": true,
        "apifyProxyGroups": [
            "RESIDENTIAL"
        ]
    },
    "startUrls": [
        "https://www.apify.com/"
    ]
}

📋 Input Schema (Full Parameters)

Field	Type	Default	Description
`startUrls`	Array of URLs	N/A	REQUIRED. The starting point(s) for the crawl.
`maxDepth`	Integer	1	Maximum link depth to crawl from the start URL.
`maxLinksPerPage`	Integer	50	Maximum number of internal links to queue from any single page.
`maxConcurrency`	Integer	5	Maximum number of pages to process simultaneously. Lower this for smaller sites.
`maxRetries`	Integer	2	Maximum number of times to retry a failed request.
`proxy`	Object	N/A	Proxy settings (`useApifyProxy`, `groups`, `countryCode`).

📁 Data Output Structure (Sample Record)

The Actor saves one JSON record per page to the Dataset. This output is standardized for seamless downstream integration.

[
  {
    "url": "https://www.apify.com/actor-name",
    "technical_metrics": {
      "status_code": 200,
      "html_size_kb": 120
    },
    "clean_text": "This is the noise-reduced text content of the page...",
    "content_hash": "a54f0d67b...",
    "markdown_content": "# Actor Name\n\nThis is the markdown version of the content...",
    "schema_json_ld": {
      "@context": "https://schema.org",
      "@type": "WebPage",
      "name": "Actor Page Title",
      "topLevel": "WebPage"
    },
    "semantic_structure": {
      "headings": { "h1": 1, "h2": 3, "h3": 5 },
      "links": { "internal_count": 45, "external_count": 12 }
    },
    "text_metrics": {
      "word_count": 520,
      "top_keywords": ["actor", "data", "web", "crawl"]
    },
    "content_blocks": [
      { "type": "heading", "word_count": 3, "label": "Main Feature" },
      { "type": "paragraph", "word_count": 45 }
    ]
  }
]

🛠️ Technical Notes

Compliance: The crawler respects the target site's robots.txt rules and crawl delay directives.
Efficiency: URLs are normalized, and redirected destinations are logged to ensure no unnecessary reprocessing occurs.
Content Hash: The content_hash field allows you to easily implement change tracking and avoid reprocessing unchanged documents in subsequent runs.

💬 Support and Contact

We are actively developing and improving this Semantic Web Crawler.

General Support & Feedback

If you encounter a bug, have a feature suggestion, or need help integrating the data:

Please open an Issue Ticket directly on the Apify platform.

Custom Solutions & Enterprise Use

For large-scale projects, custom integrations, or requirements that need bespoke development, consulting, or guaranteed support for this Actor, please contact our team directly at contact@datafusionnow.com for a custom quote and SLA. Contact LinkedIn

Enhanced Deep Content Crawler

assertive_analogy/advanced-crawler

A fast, Python-powered web crawler with smart content extraction, JS support, metadata capture, and duplicate detection. Ideal for SEO, content migration, and e-commerce scraping. Reliable, scalable, and easy to customize.

Gideon Nesh

1.0

Instagram Profile Scraper - Custom

devil_port369-owner/instagram-profile-scraper---custom

Scrape Instagram profile details. This is custom solution for related client.

DataFusionX

5.0

Instagram Scraper PPE

devil_port369-owner/instagram-scraper-ppe

Scrape Instagram user's profile details along with Email/Phone details.

DataFusionX

5.0

Html to Markdown Converter

antonio_espresso/html-to-markdown-converter

Crawl a target URL and convert its HTML content into clean, structured Markdown with optional heading-based chunking.

Antonio Blago

Sba Finder

obicodes/sba-finder

This actor calls the official SBA Small Business website and saves the returned businesses to the default dataset.

Obicodes

Bulk Instagram Profile Scraper with Email 📧

unlimitedleadtestinbox/bulk-instagram-profile-scraper-with-email

Bulk Instagram Profile Scraper with Email 📧

unli

Best Instagram Email Scraper

scraper-mind/best-instagram-email-scraper

[𝗕𝟮𝗕 𝗘𝗠𝗔𝗜𝗟 𝗔𝗩𝗔𝗜𝗟𝗔𝗕𝗟𝗘] Boost your outreach with the Instagram Email Scraper. 🔍 Extract targeted B2B/B2C Instagram emails, build niche Instagram leads, and generate verified contacts from Profiles, Posts, and Reels. Perfect for marketers, agencies, and growth teams.

Scraper Mind

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and content—perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

Mustafa Irshaid

TikTok Profile Scraper

devil_port369-owner/tiktok-profile-scraper

This tool allows you to scrape detailed user profile data from TikTok. It extracts various data points from user profiles, including user information, statistics, and profile links. This scraper is ideal for social media analysis, marketing strategies, and research.

DataFusionX

5.0

SBA.GOV Scraper | $5 / 1k | US Small Business Directory

fatihtahta/sba-gov-scraper

Scrape verified U.S. small-business data from SBA.gov including company names, certifications, NAICS, contacts, and capability statements. Perfect for vendor vetting, compliance sourcing, and lead generation. $5 per 1000 results.