Website Content Crawler
Pricing
from $20.00 / 1,000 results
Website Content Crawler
Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!
Pricing
from $20.00 / 1,000 results
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
4
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share

πΈοΈ Website Content Crawler
π Crawl an entire website and export clean Markdown in seconds. Seed from sitemaps, respect robots.txt, and fall back to a real browser for JavaScript-heavy pages. No API key, no registration, no manual pipeline code.
π Last updated: 2026-04-21 Β· π 18 fields per page Β· πΊοΈ Sitemap auto-seed Β· π€ Robots-aware Β· π HTTP + browser fallback
The Website Content Crawler walks any website from a starting URL, following internal links up to a configurable depth. It parses sitemap.xml and sitemap_index.xml to discover thousands of URLs instantly, respects robots.txt, and can switch to a headless browser when HTTP-only fetching returns thin content. Every crawled page comes back as clean Markdown plus 17 metadata fields, ready for RAG pipelines, knowledge bases, and content audits.
Built-in include and exclude regex filters let you narrow the crawl to /docs/, skip /auth/, or ignore query-heavy URLs. Concurrency defaults to 10 parallel fetches, so a 100-page crawl typically finishes in about a minute. The output uses a consistent schema across HTTP and browser modes, so downstream consumers never have to know which fetch strategy was used.
| π― Target Audience | π‘ Primary Use Cases |
|---|---|
| AI app teams, knowledge engineers, SEO specialists, documentation writers, research scientists, content archivists | RAG knowledge bases, docs mirroring, SEO audits, competitor content analysis, research corpus assembly |
π What the Website Content Crawler does
Six crawl workflows in a single run:
- πΊοΈ Sitemap auto-seed. Parses
sitemap.xmland index files to discover every public URL in seconds. - π€ Robots.txt aware. Respects disallow rules for the
*andapifyuser-agents. - π Browser fallback. Uses Playwright when a page returns thin content, handling JavaScript-heavy sites automatically.
- π Markdown extraction. Clean headings, paragraphs, lists, blockquotes, and code blocks. Navigation and footers stripped.
- π Link analytics. Counts internal and outbound links per page for site-structure analysis.
- π¦ Include/exclude patterns. Regex filters to control which URLs enter the queue.
Every page ships with title, description, language, author, publishedTime, siteName, og:image, link counts, HTTP status, response time, depth, parent URL, and a timestamp.
π‘ Why it matters: RAG pipelines, SEO audits, and knowledge bases all start with a clean crawl. Doing it yourself means writing link discovery, sitemap parsers, robots.txt logic, and a Markdown cleaner. This Actor ships all of that pre-packaged.
π¬ Full Demo
π§ Coming soon: a 3-minute walkthrough showing sitemap seeding and browser fallback in action.
βοΈ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
startUrls | array of URLs | required | One or more starting URLs for the crawl. |
maxDepth | integer | 2 | Link hops from the start URLs (0 = start URLs only). |
maxItems | integer | 10 | Pages returned. Free plan caps at 10, paid plan at 1,000,000. |
sameDomain | boolean | true | Stay within the starting domain. |
includeSubdomains | boolean | true | Follow subdomains of the root host. |
renderingType | string | "http" | http, browser, or auto (browser fallback when HTTP content is thin). |
useSitemap | boolean | true | Seed queue from sitemap.xml. |
respectRobotsTxt | boolean | true | Skip URLs disallowed by robots.txt. |
includeUrlPatterns | array of regex | [] | Only URLs matching any pattern are crawled. |
excludeUrlPatterns | array of regex | [] | URLs matching any pattern are skipped. |
Example: crawl documentation with sitemap seeding.
{"startUrls": [{ "url": "https://docs.apify.com" }],"maxDepth": 3,"maxItems": 500,"useSitemap": true,"respectRobotsTxt": true,"renderingType": "auto"}
Example: blog crawl with URL filters.
{"startUrls": [{ "url": "https://example.com" }],"maxDepth": 5,"maxItems": 200,"includeUrlPatterns": ["/blog/"],"excludeUrlPatterns": ["/tag/", "/page/"]}
β οΈ Good to Know: concurrency is capped at 10 parallel fetches to stay polite. Use browser mode only when HTTP-only returns thin content, because browser rendering is about 3x slower per page.
π Output
Each record contains 18 fields. Download the dataset as CSV, Excel, JSON, or XML.
π§Ύ Schema
| Field | Type | Example |
|---|---|---|
π url | string | "https://docs.apify.com/platform/actors" |
πͺ depth | number | 1 |
π parentUrl | string | null | "https://docs.apify.com" |
π·οΈ title | string | null | `"Actors |
π description | string | null | "Learn how Apify Actors package scrapers." |
π markdown | string | "# Actors\n\nAn Actor is..." |
π¬ text | string | "Actors An Actor is..." |
π’ wordCount | number | 860 |
π language | string | null | "en" |
π§ author | string | null | "Apify" |
π
publishedTime | ISO 8601 | null | "2024-08-15T00:00:00Z" |
π’ siteName | string | null | "Apify Documentation" |
πΌοΈ imageUrl | string | null | "https://.../og.png" |
βοΈ outboundLinks | number | 14 |
βοΈ internalLinks | number | 42 |
π’ httpStatus | number | 200 |
β±οΈ responseTimeMs | number | 210 |
π crawledAt | ISO 8601 | "2026-04-21T12:00:00.000Z" |
β error | string | null | "Timeout" on failure |
π¦ Sample records
β¨ Why choose this Actor
| Capability | |
|---|---|
| πΊοΈ | Sitemap auto-seeding. Discovers thousands of URLs from sitemap.xml instantly. |
| π€ | Robots-aware. Respects disallow rules out of the box. |
| π | HTTP plus browser. Auto fallback to Playwright when JavaScript matters. |
| π | Clean Markdown. Strips nav, footer, aside, and scripts. Preserves content structure. |
| π | Link graph. Counts internal and outbound links per page for site analysis. |
| β‘ | Fast. 100 pages in under a minute with HTTP concurrency of 10. |
| π« | No credentials. Runs on any publicly accessible site. |
π Clean crawling is the difference between a RAG pipeline that answers correctly and one that returns garbled navigation text. This Actor does the cleaning for you.
π How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| β Website Content Crawler (this Actor) | $5 free credit, then pay-per-use | Any public site | Live per run | depth, patterns, sitemap, robots | β‘ 2 min |
| Generic open-source spiders | Free | Raw HTML | Your schedule | Manual coding | π’ Days |
| Cloud crawler platforms | $$$+/month | Full enterprise | Managed | Visual rules | π Hours |
| DIY Playwright scripts | Free | Your code | Your maintenance | Whatever you build | π’ Days |
Pick this Actor when you want a clean, RAG-ready crawl with sitemap discovery and zero infrastructure.
π How to use
- π Sign up. Create a free account with $5 credit (takes 2 minutes).
- π Open the Actor. Go to the Website Content Crawler page on the Apify Store.
- π― Set input. Pick one or more start URLs, a depth limit, and
maxItems. - π Run it. Click Start and let the Actor walk the site.
- π₯ Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
β±οΈ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
πΌ Business use cases
π Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
π€ Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- π¬ ChatGPT
- π§ Claude
- π Perplexity
- π Copilot
β Frequently Asked Questions
π³ Do I need a paid Apify plan to run this actor?
No. You can start right now on the free Apify plan, which includes $5 in free monthly credit. That is enough to run this actor several times and explore the output before committing to anything. Paid plans unlock higher limits, more concurrent runs, and larger datasets. Create a free Apify account here to get started.
π¨ What happens if my run fails or returns no results?
Failed runs are not charged. If the source site changes, proxies get rate-limited, or a specific input matches nothing, re-run the actor or open our contact form and we will investigate. You can also check the run log in the Apify console to see why the run stopped.
π How many items can I scrape per run?
Free users are limited to 10 items per run so you can preview the output and confirm the actor works for your use case. Paid users can raise maxItems up to 1,000,000 per run. Upgrade here if you need full scale.
π How fresh is the data?
Every run fetches live data at the moment of execution. There is no cache or delay: the records you get reflect what the source returned at that moment. Schedule the actor to maintain a rolling snapshot of the data you need.
π§βπ» Can I call this actor from my own code?
Yes. Apify exposes every actor as a REST endpoint and ships first-class SDKs for Node.js and Python. You can start a run, read the dataset, and handle webhooks from your own app in a few lines. All you need is your Apify API token.
π€ How do I export the data?
Every Apify dataset can be downloaded in one click from the console as CSV, JSON, JSONL, Excel, HTML, XML, or RSS. You can also pull results programmatically via the Apify API or stream them into BigQuery, S3, and other destinations through built-in integrations.
π Can I schedule the actor to run automatically?
Yes. Use the Apify scheduler to run the actor on any cadence, from hourly to monthly. Results are saved to your dataset and can be delivered to webhooks, email, Slack, cloud storage, or automation tools such as Zapier and Make.
π Automating Website Content Crawler
Control the scraper programmatically for scheduled runs and pipeline integrations:
- π’ Node.js. Install the
apify-clientNPM package. - π Python. Use the
apify-clientPyPI package. - π See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Daily or weekly refreshes keep downstream databases aligned with the source site.
π Integrate with any app
Website Content Crawler connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications
- Airbyte - Pipe content into your warehouse
- GitHub - Trigger runs from commits
- Google Drive - Export Markdown to Docs
You can also use webhooks to push freshly crawled content into vector databases and RAG pipelines.
π Recommended Actors
- π€ RAG Web Browser - Search or fetch URLs with LLM-ready output
- π° Smart Article Extractor - Extract clean article text from news sites
- π Google Search Scraper - SERP results with rank and description
- π§ Contact Info Scraper - Emails, phones, and socials from URLs
- πΈ URL Screenshot Tool - Full-page screenshots as PNG, JPEG, or PDF
π‘ Pro Tip: browse the complete ParseForge collection for more AI-ready web tools.
π Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
β οΈ Disclaimer: this Actor is an independent tool and is not affiliated with any website or crawler framework. Only publicly accessible pages are crawled. Robots.txt rules are respected by default. Always honor the terms of service of the sites you crawl.
