Website Content Crawler avatar

Website Content Crawler

Pricing

from $20.00 / 1,000 results

Go to Apify Store
Website Content Crawler

Website Content Crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

Pricing

from $20.00 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

an hour ago

Last modified

Share

ParseForge Banner

πŸ•ΈοΈ Website Content Crawler

πŸš€ Crawl an entire website and export clean Markdown in seconds. Seed from sitemaps, respect robots.txt, and fall back to a real browser for JavaScript-heavy pages. No API key, no registration, no manual pipeline code.

πŸ•’ Last updated: 2026-04-21 Β· πŸ“Š 18 fields per page Β· πŸ—ΊοΈ Sitemap auto-seed Β· πŸ€– Robots-aware Β· 🌐 HTTP + browser fallback

The Website Content Crawler walks any website from a starting URL, following internal links up to a configurable depth. It parses sitemap.xml and sitemap_index.xml to discover thousands of URLs instantly, respects robots.txt, and can switch to a headless browser when HTTP-only fetching returns thin content. Every crawled page comes back as clean Markdown plus 17 metadata fields, ready for RAG pipelines, knowledge bases, and content audits.

Built-in include and exclude regex filters let you narrow the crawl to /docs/, skip /auth/, or ignore query-heavy URLs. Concurrency defaults to 10 parallel fetches, so a 100-page crawl typically finishes in about a minute. The output uses a consistent schema across HTTP and browser modes, so downstream consumers never have to know which fetch strategy was used.

🎯 Target AudienceπŸ’‘ Primary Use Cases
AI app teams, knowledge engineers, SEO specialists, documentation writers, research scientists, content archivistsRAG knowledge bases, docs mirroring, SEO audits, competitor content analysis, research corpus assembly

πŸ“‹ What the Website Content Crawler does

Six crawl workflows in a single run:

  • πŸ—ΊοΈ Sitemap auto-seed. Parses sitemap.xml and index files to discover every public URL in seconds.
  • πŸ€– Robots.txt aware. Respects disallow rules for the * and apify user-agents.
  • 🌐 Browser fallback. Uses Playwright when a page returns thin content, handling JavaScript-heavy sites automatically.
  • πŸ“ Markdown extraction. Clean headings, paragraphs, lists, blockquotes, and code blocks. Navigation and footers stripped.
  • πŸ”— Link analytics. Counts internal and outbound links per page for site-structure analysis.
  • 🚦 Include/exclude patterns. Regex filters to control which URLs enter the queue.

Every page ships with title, description, language, author, publishedTime, siteName, og:image, link counts, HTTP status, response time, depth, parent URL, and a timestamp.

πŸ’‘ Why it matters: RAG pipelines, SEO audits, and knowledge bases all start with a clean crawl. Doing it yourself means writing link discovery, sitemap parsers, robots.txt logic, and a Markdown cleaner. This Actor ships all of that pre-packaged.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing sitemap seeding and browser fallback in action.


βš™οΈ Input

InputTypeDefaultBehavior
startUrlsarray of URLsrequiredOne or more starting URLs for the crawl.
maxDepthinteger2Link hops from the start URLs (0 = start URLs only).
maxItemsinteger10Pages returned. Free plan caps at 10, paid plan at 1,000,000.
sameDomainbooleantrueStay within the starting domain.
includeSubdomainsbooleantrueFollow subdomains of the root host.
renderingTypestring"http"http, browser, or auto (browser fallback when HTTP content is thin).
useSitemapbooleantrueSeed queue from sitemap.xml.
respectRobotsTxtbooleantrueSkip URLs disallowed by robots.txt.
includeUrlPatternsarray of regex[]Only URLs matching any pattern are crawled.
excludeUrlPatternsarray of regex[]URLs matching any pattern are skipped.

Example: crawl documentation with sitemap seeding.

{
"startUrls": [{ "url": "https://docs.apify.com" }],
"maxDepth": 3,
"maxItems": 500,
"useSitemap": true,
"respectRobotsTxt": true,
"renderingType": "auto"
}

Example: blog crawl with URL filters.

{
"startUrls": [{ "url": "https://example.com" }],
"maxDepth": 5,
"maxItems": 200,
"includeUrlPatterns": ["/blog/"],
"excludeUrlPatterns": ["/tag/", "/page/"]
}

⚠️ Good to Know: concurrency is capped at 10 parallel fetches to stay polite. Use browser mode only when HTTP-only returns thin content, because browser rendering is about 3x slower per page.


πŸ“Š Output

Each record contains 18 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
πŸ”— urlstring"https://docs.apify.com/platform/actors"
πŸͺœ depthnumber1
🏠 parentUrlstring | null"https://docs.apify.com"
🏷️ titlestring | null`"Actors
πŸ“ descriptionstring | null"Learn how Apify Actors package scrapers."
πŸ“ƒ markdownstring"# Actors\n\nAn Actor is..."
πŸ’¬ textstring"Actors An Actor is..."
πŸ”’ wordCountnumber860
🌍 languagestring | null"en"
πŸ§‘ authorstring | null"Apify"
πŸ“… publishedTimeISO 8601 | null"2024-08-15T00:00:00Z"
🏒 siteNamestring | null"Apify Documentation"
πŸ–ΌοΈ imageUrlstring | null"https://.../og.png"
↗️ outboundLinksnumber14
β†˜οΈ internalLinksnumber42
🟒 httpStatusnumber200
⏱️ responseTimeMsnumber210
πŸ•’ crawledAtISO 8601"2026-04-21T12:00:00.000Z"
❗ errorstring | null"Timeout" on failure

πŸ“¦ Sample records


✨ Why choose this Actor

Capability
πŸ—ΊοΈSitemap auto-seeding. Discovers thousands of URLs from sitemap.xml instantly.
πŸ€–Robots-aware. Respects disallow rules out of the box.
🌐HTTP plus browser. Auto fallback to Playwright when JavaScript matters.
πŸ“Clean Markdown. Strips nav, footer, aside, and scripts. Preserves content structure.
πŸ”—Link graph. Counts internal and outbound links per page for site analysis.
⚑Fast. 100 pages in under a minute with HTTP concurrency of 10.
🚫No credentials. Runs on any publicly accessible site.

πŸ“Š Clean crawling is the difference between a RAG pipeline that answers correctly and one that returns garbled navigation text. This Actor does the cleaning for you.


πŸ“ˆ How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ Website Content Crawler (this Actor)$5 free credit, then pay-per-useAny public siteLive per rundepth, patterns, sitemap, robots⚑ 2 min
Generic open-source spidersFreeRaw HTMLYour scheduleManual coding🐒 Days
Cloud crawler platforms$$$+/monthFull enterpriseManagedVisual rulesπŸ•’ Hours
DIY Playwright scriptsFreeYour codeYour maintenanceWhatever you build🐒 Days

Pick this Actor when you want a clean, RAG-ready crawl with sitemap discovery and zero infrastructure.


πŸš€ How to use

  1. πŸ“ Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the Website Content Crawler page on the Apify Store.
  3. 🎯 Set input. Pick one or more start URLs, a depth limit, and maxItems.
  4. πŸš€ Run it. Click Start and let the Actor walk the site.
  5. πŸ“₯ Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


πŸ’Ό Business use cases

🧠 AI Knowledge Bases

  • Feed product docs into a vector database
  • Sync internal wikis into a RAG index
  • Refresh chatbot context on a schedule
  • Build training corpora from public sites

πŸ“ˆ SEO & Content Audits

  • Inventory every public page on a site
  • Map internal and outbound link structure
  • Detect orphan and 404 pages
  • Compare competitor content footprints

πŸ“š Documentation Mirroring

  • Archive documentation for offline use
  • Snapshot support portals for compliance
  • Monitor API reference changes over time
  • Build plain-Markdown docs archives

πŸ§‘β€πŸ”¬ Research Corpora

  • Extract text datasets from academic sites
  • Gather news archives by domain
  • Build language modeling corpora
  • Snapshot regulatory content for analysis

πŸ”Œ Automating Website Content Crawler

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟒 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • πŸ“š See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Daily or weekly refreshes keep downstream databases aligned with the source site.


❓ Frequently Asked Questions


πŸ”Œ Integrate with any app

Website Content Crawler connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications
  • Airbyte - Pipe content into your warehouse
  • GitHub - Trigger runs from commits
  • Google Drive - Export Markdown to Docs

You can also use webhooks to push freshly crawled content into vector databases and RAG pipelines.


πŸ’‘ Pro Tip: browse the complete ParseForge collection for more AI-ready web tools.


πŸ†˜ Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any website or crawler framework. Only publicly accessible pages are crawled. Robots.txt rules are respected by default. Always honor the terms of service of the sites you crawl.