Website Content Crawler
Under maintenancePricing
Pay per usage
Go to Apify Store
Website Content Crawler
Under maintenancePricing
Pay per usage
Rating
0.0
(0)
Developer
Syed Rupom
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
1
Monthly active users
3 hours ago
Last modified
Categories
Share
Crawl any website and extract clean, structured content. Outputs plain text, Markdown, or raw HTML — optimized for AI/LLM applications, RAG pipelines, documentation indexing, chatbot training, and content analysis.
Features
- Full site crawling: Follows internal links up to configurable depth
- Smart content extraction: Auto-detects main content, strips nav/header/footer/ads
- Multiple output formats: Markdown (AI-ready), plain text, or raw HTML
- JavaScript rendering: Full Puppeteer-based crawling handles React, Vue, and dynamic sites
- JSON-LD extraction: Structured data schemas embedded in pages
- Configurable depth & page limits: Control exactly how much to crawl
- Custom selectors: Target specific content areas or remove specific elements
- Subdomain support: Optionally follow links to subdomains
Output Fields Per Page
| Field | Description |
|---|---|
url | Original URL |
loaded_url | Final URL after redirects |
title | Page <title> |
description | Meta description |
author | Author meta tag |
keywords | Meta keywords |
og_image | Open Graph image URL |
canonical | Canonical URL |
lang | Page language code |
h1 | Main heading |
h2s | Top subheadings (up to 10) |
text | Clean plain text (format=text) |
markdown | Markdown-formatted content (format=markdown) |
html | Content HTML (format=html) |
json_ld | JSON-LD structured data objects |
depth | Crawl depth from start URL |
referrer | Page that linked here |
load_time_ms | Page load time in ms |
status_code | HTTP status code |
links_found | Number of links on the page |
crawled_at | ISO timestamp |
Input
{"startUrls": [{"url": "https://docs.example.com"}],"maxPages": 100,"maxDepth": 3,"includeSubdomains": false,"outputFormat": "markdown","extractSelector": "article","removeSelectors": [".sidebar", ".related-posts"],"proxyConfiguration": {"useApifyProxy": false}}
Use Cases
- AI Training Data: Extract clean, structured web content at scale
- RAG Pipelines: Feed documentation sites into vector databases (Pinecone, Qdrant, Weaviate)
- Custom ChatGPT: Build knowledge bases from product documentation
- Content Auditing: Extract and analyze all text across a website
- Competitive Research: Extract competitor content for analysis
- Documentation Indexing: Index technical docs for search
Tips
- Set
maxDepth: 0to only scrape the start URLs without following links - Use
extractSelector: "main"to target only the main content area - Set
outputFormat: "markdown"for best results with AI/LLM ingestion - Most public sites work without proxies; enable proxies for rate-limited sites