Website Content Crawler avatar

Website Content Crawler

Pricing

from $20.00 / 1,000 results

Go to Apify Store
Website Content Crawler

Website Content Crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

Pricing

from $20.00 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

1

Monthly active users

4 days ago

Last modified

Share

ParseForge Banner

πŸ•ΈοΈ Website Content Crawler

πŸš€ Crawl an entire website and export clean Markdown in seconds. Seed from sitemaps, respect robots.txt, and fall back to a real browser for JavaScript-heavy pages. No API key, no registration, no manual pipeline code.

πŸ•’ Last updated: 2026-04-21 Β· πŸ“Š 18 fields per page Β· πŸ—ΊοΈ Sitemap auto-seed Β· πŸ€– Robots-aware Β· 🌐 HTTP + browser fallback

The Website Content Crawler walks any website from a starting URL, following internal links up to a configurable depth. It parses sitemap.xml and sitemap_index.xml to discover thousands of URLs instantly, respects robots.txt, and can switch to a headless browser when HTTP-only fetching returns thin content. Every crawled page comes back as clean Markdown plus 17 metadata fields, ready for RAG pipelines, knowledge bases, and content audits.

Built-in include and exclude regex filters let you narrow the crawl to /docs/, skip /auth/, or ignore query-heavy URLs. Concurrency defaults to 10 parallel fetches, so a 100-page crawl typically finishes in about a minute. The output uses a consistent schema across HTTP and browser modes, so downstream consumers never have to know which fetch strategy was used.

🎯 Target AudienceπŸ’‘ Primary Use Cases
AI app teams, knowledge engineers, SEO specialists, documentation writers, research scientists, content archivistsRAG knowledge bases, docs mirroring, SEO audits, competitor content analysis, research corpus assembly

πŸ“‹ What the Website Content Crawler does

Six crawl workflows in a single run:

  • πŸ—ΊοΈ Sitemap auto-seed. Parses sitemap.xml and index files to discover every public URL in seconds.
  • πŸ€– Robots.txt aware. Respects disallow rules for the * and apify user-agents.
  • 🌐 Browser fallback. Uses Playwright when a page returns thin content, handling JavaScript-heavy sites automatically.
  • πŸ“ Markdown extraction. Clean headings, paragraphs, lists, blockquotes, and code blocks. Navigation and footers stripped.
  • πŸ”— Link analytics. Counts internal and outbound links per page for site-structure analysis.
  • 🚦 Include/exclude patterns. Regex filters to control which URLs enter the queue.

Every page ships with title, description, language, author, publishedTime, siteName, og:image, link counts, HTTP status, response time, depth, parent URL, and a timestamp.

πŸ’‘ Why it matters: RAG pipelines, SEO audits, and knowledge bases all start with a clean crawl. Doing it yourself means writing link discovery, sitemap parsers, robots.txt logic, and a Markdown cleaner. This Actor ships all of that pre-packaged.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing sitemap seeding and browser fallback in action.


βš™οΈ Input

InputTypeDefaultBehavior
startUrlsarray of URLsrequiredOne or more starting URLs for the crawl.
maxDepthinteger2Link hops from the start URLs (0 = start URLs only).
maxItemsinteger10Pages returned. Free plan caps at 10, paid plan at 1,000,000.
sameDomainbooleantrueStay within the starting domain.
includeSubdomainsbooleantrueFollow subdomains of the root host.
renderingTypestring"http"http, browser, or auto (browser fallback when HTTP content is thin).
useSitemapbooleantrueSeed queue from sitemap.xml.
respectRobotsTxtbooleantrueSkip URLs disallowed by robots.txt.
includeUrlPatternsarray of regex[]Only URLs matching any pattern are crawled.
excludeUrlPatternsarray of regex[]URLs matching any pattern are skipped.

Example: crawl documentation with sitemap seeding.

{
"startUrls": [{ "url": "https://docs.apify.com" }],
"maxDepth": 3,
"maxItems": 500,
"useSitemap": true,
"respectRobotsTxt": true,
"renderingType": "auto"
}

Example: blog crawl with URL filters.

{
"startUrls": [{ "url": "https://example.com" }],
"maxDepth": 5,
"maxItems": 200,
"includeUrlPatterns": ["/blog/"],
"excludeUrlPatterns": ["/tag/", "/page/"]
}

⚠️ Good to Know: concurrency is capped at 10 parallel fetches to stay polite. Use browser mode only when HTTP-only returns thin content, because browser rendering is about 3x slower per page.


πŸ“Š Output

Each record contains 18 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
πŸ”— urlstring"https://docs.apify.com/platform/actors"
πŸͺœ depthnumber1
🏠 parentUrlstring | null"https://docs.apify.com"
🏷️ titlestring | null`"Actors
πŸ“ descriptionstring | null"Learn how Apify Actors package scrapers."
πŸ“ƒ markdownstring"# Actors\n\nAn Actor is..."
πŸ’¬ textstring"Actors An Actor is..."
πŸ”’ wordCountnumber860
🌍 languagestring | null"en"
πŸ§‘ authorstring | null"Apify"
πŸ“… publishedTimeISO 8601 | null"2024-08-15T00:00:00Z"
🏒 siteNamestring | null"Apify Documentation"
πŸ–ΌοΈ imageUrlstring | null"https://.../og.png"
↗️ outboundLinksnumber14
β†˜οΈ internalLinksnumber42
🟒 httpStatusnumber200
⏱️ responseTimeMsnumber210
πŸ•’ crawledAtISO 8601"2026-04-21T12:00:00.000Z"
❗ errorstring | null"Timeout" on failure

πŸ“¦ Sample records


✨ Why choose this Actor

Capability
πŸ—ΊοΈSitemap auto-seeding. Discovers thousands of URLs from sitemap.xml instantly.
πŸ€–Robots-aware. Respects disallow rules out of the box.
🌐HTTP plus browser. Auto fallback to Playwright when JavaScript matters.
πŸ“Clean Markdown. Strips nav, footer, aside, and scripts. Preserves content structure.
πŸ”—Link graph. Counts internal and outbound links per page for site analysis.
⚑Fast. 100 pages in under a minute with HTTP concurrency of 10.
🚫No credentials. Runs on any publicly accessible site.

πŸ“Š Clean crawling is the difference between a RAG pipeline that answers correctly and one that returns garbled navigation text. This Actor does the cleaning for you.


πŸ“ˆ How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ Website Content Crawler (this Actor)$5 free credit, then pay-per-useAny public siteLive per rundepth, patterns, sitemap, robots⚑ 2 min
Generic open-source spidersFreeRaw HTMLYour scheduleManual coding🐒 Days
Cloud crawler platforms$$$+/monthFull enterpriseManagedVisual rulesπŸ•’ Hours
DIY Playwright scriptsFreeYour codeYour maintenanceWhatever you build🐒 Days

Pick this Actor when you want a clean, RAG-ready crawl with sitemap discovery and zero infrastructure.


πŸš€ How to use

  1. πŸ“ Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the Website Content Crawler page on the Apify Store.
  3. 🎯 Set input. Pick one or more start URLs, a depth limit, and maxItems.
  4. πŸš€ Run it. Click Start and let the Actor walk the site.
  5. πŸ“₯ Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


πŸ’Ό Business use cases

🧠 AI Knowledge Bases

  • Feed product docs into a vector database
  • Sync internal wikis into a RAG index
  • Refresh chatbot context on a schedule
  • Build training corpora from public sites

πŸ“ˆ SEO & Content Audits

  • Inventory every public page on a site
  • Map internal and outbound link structure
  • Detect orphan and 404 pages
  • Compare competitor content footprints

πŸ“š Documentation Mirroring

  • Archive documentation for offline use
  • Snapshot support portals for compliance
  • Monitor API reference changes over time
  • Build plain-Markdown docs archives

πŸ§‘β€πŸ”¬ Research Corpora

  • Extract text datasets from academic sites
  • Gather news archives by domain
  • Build language modeling corpora
  • Snapshot regulatory content for analysis


🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

πŸŽ“ Research and academia

  • Empirical datasets for papers, thesis work, and coursework
  • Longitudinal studies tracking changes across snapshots
  • Reproducible research with cited, versioned data pulls
  • Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

  • Side projects, portfolio demos, and indie app launches
  • Data visualizations, dashboards, and infographics
  • Content research for bloggers, YouTubers, and podcasters
  • Hobbyist collections and personal trackers

🀝 Non-profit and civic

  • Transparency reporting and accountability projects
  • Advocacy campaigns backed by public-interest data
  • Community-run databases for local issues
  • Investigative journalism on public records

πŸ§ͺ Experimentation

  • Prototype AI and machine-learning pipelines with real data
  • Validate product-market hypotheses before engineering spend
  • Train small domain-specific models on niche corpora
  • Test dashboard concepts with live input

πŸ€– Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

❓ Frequently Asked Questions

πŸ’³ Do I need a paid Apify plan to run this actor?

No. You can start right now on the free Apify plan, which includes $5 in free monthly credit. That is enough to run this actor several times and explore the output before committing to anything. Paid plans unlock higher limits, more concurrent runs, and larger datasets. Create a free Apify account here to get started.

🚨 What happens if my run fails or returns no results?

Failed runs are not charged. If the source site changes, proxies get rate-limited, or a specific input matches nothing, re-run the actor or open our contact form and we will investigate. You can also check the run log in the Apify console to see why the run stopped.

πŸ“ How many items can I scrape per run?

Free users are limited to 10 items per run so you can preview the output and confirm the actor works for your use case. Paid users can raise maxItems up to 1,000,000 per run. Upgrade here if you need full scale.

πŸ•’ How fresh is the data?

Every run fetches live data at the moment of execution. There is no cache or delay: the records you get reflect what the source returned at that moment. Schedule the actor to maintain a rolling snapshot of the data you need.

πŸ§‘β€πŸ’» Can I call this actor from my own code?

Yes. Apify exposes every actor as a REST endpoint and ships first-class SDKs for Node.js and Python. You can start a run, read the dataset, and handle webhooks from your own app in a few lines. All you need is your Apify API token.

πŸ“€ How do I export the data?

Every Apify dataset can be downloaded in one click from the console as CSV, JSON, JSONL, Excel, HTML, XML, or RSS. You can also pull results programmatically via the Apify API or stream them into BigQuery, S3, and other destinations through built-in integrations.

πŸ“… Can I schedule the actor to run automatically?

Yes. Use the Apify scheduler to run the actor on any cadence, from hourly to monthly. Results are saved to your dataset and can be delivered to webhooks, email, Slack, cloud storage, or automation tools such as Zapier and Make.


πŸ”Œ Automating Website Content Crawler

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟒 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • πŸ“š See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Daily or weekly refreshes keep downstream databases aligned with the source site.

πŸ”Œ Integrate with any app

Website Content Crawler connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications
  • Airbyte - Pipe content into your warehouse
  • GitHub - Trigger runs from commits
  • Google Drive - Export Markdown to Docs

You can also use webhooks to push freshly crawled content into vector databases and RAG pipelines.


πŸ’‘ Pro Tip: browse the complete ParseForge collection for more AI-ready web tools.


πŸ†˜ Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any website or crawler framework. Only publicly accessible pages are crawled. Robots.txt rules are respected by default. Always honor the terms of service of the sites you crawl.