Pricing

from $1.50 / 1,000 results

Go to Apify Store

Website Content Crawler

Try for free

Pricing

from $1.50 / 1,000 results

Rating

0.0

(0)

Developer

Syed Rupom

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Features

Full site crawling: Follows internal links up to configurable depth
Smart content extraction: Auto-detects main content, strips nav/header/footer/ads
Multiple output formats: Markdown (AI-ready), plain text, or raw HTML
JavaScript rendering: Full Puppeteer-based crawling handles React, Vue, and dynamic sites
JSON-LD extraction: Structured data schemas embedded in pages
Configurable depth & page limits: Control exactly how much to crawl
Custom selectors: Target specific content areas or remove specific elements
Subdomain support: Optionally follow links to subdomains

Output Fields Per Page

Field	Description
`url`	Original URL
`loaded_url`	Final URL after redirects
`title`	Page `<title>`
`description`	Meta description
`author`	Author meta tag
`keywords`	Meta keywords
`og_image`	Open Graph image URL
`canonical`	Canonical URL
`lang`	Page language code
`h1`	Main heading
`h2s`	Top subheadings (up to 10)
`text`	Clean plain text (format=text)
`markdown`	Markdown-formatted content (format=markdown)
`html`	Content HTML (format=html)
`json_ld`	JSON-LD structured data objects
`depth`	Crawl depth from start URL
`referrer`	Page that linked here
`load_time_ms`	Page load time in ms
`status_code`	HTTP status code
`links_found`	Number of links on the page
`crawled_at`	ISO timestamp

Input

{
  "startUrls": [{"url": "https://docs.example.com"}],
  "maxPages": 100,
  "maxDepth": 3,
  "includeSubdomains": false,
  "outputFormat": "markdown",
  "extractSelector": "article",
  "removeSelectors": [".sidebar", ".related-posts"],
  "proxyConfiguration": {"useApifyProxy": false}
}

Use Cases

AI Training Data: Extract clean, structured web content at scale
RAG Pipelines: Feed documentation sites into vector databases (Pinecone, Qdrant, Weaviate)
Custom ChatGPT: Build knowledge bases from product documentation
Content Auditing: Extract and analyze all text across a website
Competitive Research: Extract competitor content for analysis
Documentation Indexing: Index technical docs for search

Tips

Set maxDepth: 0 to only scrape the start URLs without following links
Use extractSelector: "main" to target only the main content area
Set outputFormat: "markdown" for best results with AI/LLM ingestion
Most public sites work without proxies; enable proxies for rate-limited sites

Website Content Crawler

bhansalisoft/website-content-crawler

Website Content Crawler : scrap any website content with meta title and meta description and site logo

bhansalisoft

AI Website Content Crawler

ilborso/ai-website-content-crawler

A super fast website crawler for Agentic AI integration

Fabio Borsotti

5.0

Website Content Crawler — Extract Full Site Content

oneary/website-content-crawler

🌐 Full website crawler that extracts structured content (text, headings, metadata, links, images) from any domain. Free platform compute pricing.

Luan M.

Website Crawler API — Markdown for RAG

tugelbay/website-content-crawler

Website crawler API for public pages and clean Markdown, text, or HTML output for RAG pipelines, AI agents, documentation indexing, and monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

Website Content Crawler Fast

timelody/website-content-crawler-fast

Scraping data from every single web page.

timelody

5.0

No-BS Content Crawler 🖕

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.

hafsah nuzhat

5.0

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David

4.2K

4.9

Website Content Crawler

novashieldai/website-content-crawler

Universal website crawler that extracts clean text/markdown content, metadata, links, and images from any URL. Features sitemap parsing, robots.txt respect, and multi-page BFS crawling with depth control.

Ali haydar Karadaş

Website Contacts Crawler

quaking_pail/contact-crawler

Scrap website searching for contact details, emails and phone numbers

AI_Builder

Website Content Crawler Lite

fetch_cat/website-content-crawler-lite

Fast, reliable HTTP website crawler for clean text, Markdown, HTML, metadata, and links. Built for AI/RAG, SEO audits, monitoring, and automation with robots.txt, retries, proxy support, and bounded exports.