Website Content Crawler
Pricing
from $20.00 / 1,000 results
Website Content Crawler
Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!
Pricing
from $20.00 / 1,000 results
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
an hour ago
Last modified
Categories
Share

πΈοΈ Website Content Crawler
π Crawl an entire website and export clean Markdown in seconds. Seed from sitemaps, respect robots.txt, and fall back to a real browser for JavaScript-heavy pages. No API key, no registration, no manual pipeline code.
π Last updated: 2026-04-21 Β· π 18 fields per page Β· πΊοΈ Sitemap auto-seed Β· π€ Robots-aware Β· π HTTP + browser fallback
The Website Content Crawler walks any website from a starting URL, following internal links up to a configurable depth. It parses sitemap.xml and sitemap_index.xml to discover thousands of URLs instantly, respects robots.txt, and can switch to a headless browser when HTTP-only fetching returns thin content. Every crawled page comes back as clean Markdown plus 17 metadata fields, ready for RAG pipelines, knowledge bases, and content audits.
Built-in include and exclude regex filters let you narrow the crawl to /docs/, skip /auth/, or ignore query-heavy URLs. Concurrency defaults to 10 parallel fetches, so a 100-page crawl typically finishes in about a minute. The output uses a consistent schema across HTTP and browser modes, so downstream consumers never have to know which fetch strategy was used.
| π― Target Audience | π‘ Primary Use Cases |
|---|---|
| AI app teams, knowledge engineers, SEO specialists, documentation writers, research scientists, content archivists | RAG knowledge bases, docs mirroring, SEO audits, competitor content analysis, research corpus assembly |
π What the Website Content Crawler does
Six crawl workflows in a single run:
- πΊοΈ Sitemap auto-seed. Parses
sitemap.xmland index files to discover every public URL in seconds. - π€ Robots.txt aware. Respects disallow rules for the
*andapifyuser-agents. - π Browser fallback. Uses Playwright when a page returns thin content, handling JavaScript-heavy sites automatically.
- π Markdown extraction. Clean headings, paragraphs, lists, blockquotes, and code blocks. Navigation and footers stripped.
- π Link analytics. Counts internal and outbound links per page for site-structure analysis.
- π¦ Include/exclude patterns. Regex filters to control which URLs enter the queue.
Every page ships with title, description, language, author, publishedTime, siteName, og:image, link counts, HTTP status, response time, depth, parent URL, and a timestamp.
π‘ Why it matters: RAG pipelines, SEO audits, and knowledge bases all start with a clean crawl. Doing it yourself means writing link discovery, sitemap parsers, robots.txt logic, and a Markdown cleaner. This Actor ships all of that pre-packaged.
π¬ Full Demo
π§ Coming soon: a 3-minute walkthrough showing sitemap seeding and browser fallback in action.
βοΈ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
startUrls | array of URLs | required | One or more starting URLs for the crawl. |
maxDepth | integer | 2 | Link hops from the start URLs (0 = start URLs only). |
maxItems | integer | 10 | Pages returned. Free plan caps at 10, paid plan at 1,000,000. |
sameDomain | boolean | true | Stay within the starting domain. |
includeSubdomains | boolean | true | Follow subdomains of the root host. |
renderingType | string | "http" | http, browser, or auto (browser fallback when HTTP content is thin). |
useSitemap | boolean | true | Seed queue from sitemap.xml. |
respectRobotsTxt | boolean | true | Skip URLs disallowed by robots.txt. |
includeUrlPatterns | array of regex | [] | Only URLs matching any pattern are crawled. |
excludeUrlPatterns | array of regex | [] | URLs matching any pattern are skipped. |
Example: crawl documentation with sitemap seeding.
{"startUrls": [{ "url": "https://docs.apify.com" }],"maxDepth": 3,"maxItems": 500,"useSitemap": true,"respectRobotsTxt": true,"renderingType": "auto"}
Example: blog crawl with URL filters.
{"startUrls": [{ "url": "https://example.com" }],"maxDepth": 5,"maxItems": 200,"includeUrlPatterns": ["/blog/"],"excludeUrlPatterns": ["/tag/", "/page/"]}
β οΈ Good to Know: concurrency is capped at 10 parallel fetches to stay polite. Use browser mode only when HTTP-only returns thin content, because browser rendering is about 3x slower per page.
π Output
Each record contains 18 fields. Download the dataset as CSV, Excel, JSON, or XML.
π§Ύ Schema
| Field | Type | Example |
|---|---|---|
π url | string | "https://docs.apify.com/platform/actors" |
πͺ depth | number | 1 |
π parentUrl | string | null | "https://docs.apify.com" |
π·οΈ title | string | null | `"Actors |
π description | string | null | "Learn how Apify Actors package scrapers." |
π markdown | string | "# Actors\n\nAn Actor is..." |
π¬ text | string | "Actors An Actor is..." |
π’ wordCount | number | 860 |
π language | string | null | "en" |
π§ author | string | null | "Apify" |
π
publishedTime | ISO 8601 | null | "2024-08-15T00:00:00Z" |
π’ siteName | string | null | "Apify Documentation" |
πΌοΈ imageUrl | string | null | "https://.../og.png" |
βοΈ outboundLinks | number | 14 |
βοΈ internalLinks | number | 42 |
π’ httpStatus | number | 200 |
β±οΈ responseTimeMs | number | 210 |
π crawledAt | ISO 8601 | "2026-04-21T12:00:00.000Z" |
β error | string | null | "Timeout" on failure |
π¦ Sample records
β¨ Why choose this Actor
| Capability | |
|---|---|
| πΊοΈ | Sitemap auto-seeding. Discovers thousands of URLs from sitemap.xml instantly. |
| π€ | Robots-aware. Respects disallow rules out of the box. |
| π | HTTP plus browser. Auto fallback to Playwright when JavaScript matters. |
| π | Clean Markdown. Strips nav, footer, aside, and scripts. Preserves content structure. |
| π | Link graph. Counts internal and outbound links per page for site analysis. |
| β‘ | Fast. 100 pages in under a minute with HTTP concurrency of 10. |
| π« | No credentials. Runs on any publicly accessible site. |
π Clean crawling is the difference between a RAG pipeline that answers correctly and one that returns garbled navigation text. This Actor does the cleaning for you.
π How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| β Website Content Crawler (this Actor) | $5 free credit, then pay-per-use | Any public site | Live per run | depth, patterns, sitemap, robots | β‘ 2 min |
| Generic open-source spiders | Free | Raw HTML | Your schedule | Manual coding | π’ Days |
| Cloud crawler platforms | $$$+/month | Full enterprise | Managed | Visual rules | π Hours |
| DIY Playwright scripts | Free | Your code | Your maintenance | Whatever you build | π’ Days |
Pick this Actor when you want a clean, RAG-ready crawl with sitemap discovery and zero infrastructure.
π How to use
- π Sign up. Create a free account with $5 credit (takes 2 minutes).
- π Open the Actor. Go to the Website Content Crawler page on the Apify Store.
- π― Set input. Pick one or more start URLs, a depth limit, and
maxItems. - π Run it. Click Start and let the Actor walk the site.
- π₯ Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
β±οΈ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
πΌ Business use cases
π Automating Website Content Crawler
Control the scraper programmatically for scheduled runs and pipeline integrations:
- π’ Node.js. Install the
apify-clientNPM package. - π Python. Use the
apify-clientPyPI package. - π See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Daily or weekly refreshes keep downstream databases aligned with the source site.
β Frequently Asked Questions
π Integrate with any app
Website Content Crawler connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications
- Airbyte - Pipe content into your warehouse
- GitHub - Trigger runs from commits
- Google Drive - Export Markdown to Docs
You can also use webhooks to push freshly crawled content into vector databases and RAG pipelines.
π Recommended Actors
- π€ RAG Web Browser - Search or fetch URLs with LLM-ready output
- π° Smart Article Extractor - Extract clean article text from news sites
- π Google Search Scraper - SERP results with rank and description
- π§ Contact Info Scraper - Emails, phones, and socials from URLs
- πΈ URL Screenshot Tool - Full-page screenshots as PNG, JPEG, or PDF
π‘ Pro Tip: browse the complete ParseForge collection for more AI-ready web tools.
π Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
β οΈ Disclaimer: this Actor is an independent tool and is not affiliated with any website or crawler framework. Only publicly accessible pages are crawled. Robots.txt rules are respected by default. Always honor the terms of service of the sites you crawl.