Pricing

from $20.00 / 1,000 results

Website Content Crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

Pricing

from $20.00 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🕸️ Website Content Crawler

🚀 Crawl an entire website and export clean Markdown in seconds. Seed from sitemaps, respect robots.txt, and fall back to a real browser for JavaScript-heavy pages. No API key, no registration, no manual pipeline code.

🕒 Last updated: 2026-04-21 · 📊 18 fields per page · 🗺️ Sitemap auto-seed · 🤖 Robots-aware · 🌐 HTTP + browser fallback

The Website Content Crawler walks any website from a starting URL, following internal links up to a configurable depth. It parses sitemap.xml and sitemap_index.xml to discover thousands of URLs instantly, respects robots.txt, and can switch to a headless browser when HTTP-only fetching returns thin content. Every crawled page comes back as clean Markdown plus 17 metadata fields, ready for RAG pipelines, knowledge bases, and content audits.

Built-in include and exclude regex filters let you narrow the crawl to /docs/, skip /auth/, or ignore query-heavy URLs. Concurrency defaults to 10 parallel fetches, so a 100-page crawl typically finishes in about a minute. The output uses a consistent schema across HTTP and browser modes, so downstream consumers never have to know which fetch strategy was used.

🎯 Target Audience	💡 Primary Use Cases
AI app teams, knowledge engineers, SEO specialists, documentation writers, research scientists, content archivists	RAG knowledge bases, docs mirroring, SEO audits, competitor content analysis, research corpus assembly

📋 What the Website Content Crawler does

Six crawl workflows in a single run:

🗺️ Sitemap auto-seed. Parses sitemap.xml and index files to discover every public URL in seconds.
🤖 Robots.txt aware. Respects disallow rules for the * and apify user-agents.
🌐 Browser fallback. Uses Playwright when a page returns thin content, handling JavaScript-heavy sites automatically.
📝 Markdown extraction. Clean headings, paragraphs, lists, blockquotes, and code blocks. Navigation and footers stripped.
🔗 Link analytics. Counts internal and outbound links per page for site-structure analysis.
🚦 Include/exclude patterns. Regex filters to control which URLs enter the queue.

Every page ships with title, description, language, author, publishedTime, siteName, og:image, link counts, HTTP status, response time, depth, parent URL, and a timestamp.

💡 Why it matters: RAG pipelines, SEO audits, and knowledge bases all start with a clean crawl. Doing it yourself means writing link discovery, sitemap parsers, robots.txt logic, and a Markdown cleaner. This Actor ships all of that pre-packaged.

🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing sitemap seeding and browser fallback in action.

⚙️ Input

Input	Type	Default	Behavior
startUrls	array of URLs	required	One or more starting URLs for the crawl.
maxDepth	integer	2	Link hops from the start URLs (0 = start URLs only).
maxItems	integer	10	Pages returned. Free plan caps at 10, paid plan at 1,000,000.
sameDomain	boolean	true	Stay within the starting domain.
includeSubdomains	boolean	true	Follow subdomains of the root host.
renderingType	string	"http"	http, browser, or auto (browser fallback when HTTP content is thin).
useSitemap	boolean	true	Seed queue from sitemap.xml.
respectRobotsTxt	boolean	true	Skip URLs disallowed by robots.txt.
includeUrlPatterns	array of regex	[]	Only URLs matching any pattern are crawled.
excludeUrlPatterns	array of regex	[]	URLs matching any pattern are skipped.

Example: crawl documentation with sitemap seeding.

{
    "startUrls": [{ "url": "https://docs.apify.com" }],
    "maxDepth": 3,
    "maxItems": 500,
    "useSitemap": true,
    "respectRobotsTxt": true,
    "renderingType": "auto"
}

Example: blog crawl with URL filters.

{
    "startUrls": [{ "url": "https://example.com" }],
    "maxDepth": 5,
    "maxItems": 200,
    "includeUrlPatterns": ["/blog/"],
    "excludeUrlPatterns": ["/tag/", "/page/"]
}

⚠️ Good to Know: concurrency is capped at 10 parallel fetches to stay polite. Use browser mode only when HTTP-only returns thin content, because browser rendering is about 3x slower per page.

📊 Output

Each record contains 18 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

Field	Type	Example
🔗 url	string	`"https://docs.apify.com/platform/actors"`
🪜 depth	number	`1`
🏠 parentUrl	string \| null	`"https://docs.apify.com"`
🏷️ title	string \| null	`"Actors
📝 description	string \| null	`"Learn how Apify Actors package scrapers."`
📃 markdown	string	`"# Actors\n\nAn Actor is..."`
💬 text	string	`"Actors An Actor is..."`
🔢 wordCount	number	`860`
🌍 language	string \| null	`"en"`
🧑 author	string \| null	`"Apify"`
📅 publishedTime	ISO 8601 \| null	`"2024-08-15T00:00:00Z"`
🏢 siteName	string \| null	`"Apify Documentation"`
🖼️ imageUrl	string \| null	`"https://.../og.png"`
↗️ outboundLinks	number	`14`
↘️ internalLinks	number	`42`
🟢 httpStatus	number	`200`
⏱️ responseTimeMs	number	`210`
🕒 crawledAt	ISO 8601	`"2026-04-21T12:00:00.000Z"`
❗ error	string \| null	`"Timeout"` on failure

📦 Sample records

✨ Why choose this Actor

	Capability
🗺️	Sitemap auto-seeding. Discovers thousands of URLs from `sitemap.xml` instantly.
🤖	Robots-aware. Respects disallow rules out of the box.
🌐	HTTP plus browser. Auto fallback to Playwright when JavaScript matters.
📝	Clean Markdown. Strips nav, footer, aside, and scripts. Preserves content structure.
🔗	Link graph. Counts internal and outbound links per page for site analysis.
⚡	Fast. 100 pages in under a minute with HTTP concurrency of 10.
🚫	No credentials. Runs on any publicly accessible site.

📊 Clean crawling is the difference between a RAG pipeline that answers correctly and one that returns garbled navigation text. This Actor does the cleaning for you.

📈 How it compares to alternatives

Approach	Cost	Coverage	Refresh	Filters	Setup
⭐ Website Content Crawler (this Actor)	$5 free credit, then pay-per-use	Any public site	Live per run	depth, patterns, sitemap, robots	⚡ 2 min
Generic open-source spiders	Free	Raw HTML	Your schedule	Manual coding	🐢 Days
Cloud crawler platforms	$$$+/month	Full enterprise	Managed	Visual rules	🕒 Hours
DIY Playwright scripts	Free	Your code	Your maintenance	Whatever you build	🐢 Days

Pick this Actor when you want a clean, RAG-ready crawl with sitemap discovery and zero infrastructure.

🚀 How to use

📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
🌐 Open the Actor. Go to the Website Content Crawler page on the Apify Store.
🎯 Set input. Pick one or more start URLs, a depth limit, and maxItems.
🚀 Run it. Click Start and let the Actor walk the site.
📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.

💼 Business use cases

🧠 AI Knowledge Bases

Feed product docs into a vector database
Sync internal wikis into a RAG index
Refresh chatbot context on a schedule
Build training corpora from public sites

📈 SEO & Content Audits

Inventory every public page on a site
Map internal and outbound link structure
Detect orphan and 404 pages
Compare competitor content footprints

📚 Documentation Mirroring

Archive documentation for offline use
Snapshot support portals for compliance
Monitor API reference changes over time
Build plain-Markdown docs archives

🧑‍🔬 Research Corpora

Extract text datasets from academic sites
Gather news archives by domain
Build language modeling corpora
Snapshot regulatory content for analysis

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

Empirical datasets for papers, thesis work, and coursework
Longitudinal studies tracking changes across snapshots
Reproducible research with cited, versioned data pulls
Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

Side projects, portfolio demos, and indie app launches
Data visualizations, dashboards, and infographics
Content research for bloggers, YouTubers, and podcasters
Hobbyist collections and personal trackers

🤝 Non-profit and civic

Transparency reporting and accountability projects
Advocacy campaigns backed by public-interest data
Community-run databases for local issues
Investigative journalism on public records

🧪 Experimentation

Prototype AI and machine-learning pipelines with real data
Validate product-market hypotheses before engineering spend
Train small domain-specific models on niche corpora
Test dashboard concepts with live input

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

❓ Frequently Asked Questions

🔌 Automating Website Content Crawler

Control the scraper programmatically for scheduled runs and pipeline integrations:

🟢 Node.js. Install the apify-client NPM package.
🐍 Python. Use the apify-client PyPI package.
📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Daily or weekly refreshes keep downstream databases aligned with the source site.

🔌 Integrate with any app

Website Content Crawler connects to any cloud service via Apify integrations:

Make - Automate multi-step workflows
Zapier - Connect with 5,000+ apps
Slack - Get run notifications
Airbyte - Pipe content into your warehouse
GitHub - Trigger runs from commits
Google Drive - Export Markdown to Docs

You can also use webhooks to push freshly crawled content into vector databases and RAG pipelines.

🔗 Recommended Actors

🤖 RAG Web Browser - Search or fetch URLs with LLM-ready output
📰 Smart Article Extractor - Extract clean article text from news sites
🔍 Google Search Scraper - SERP results with rank and description
📧 Contact Info Scraper - Emails, phones, and socials from URLs
📸 URL Screenshot Tool - Full-page screenshots as PNG, JPEG, or PDF

💡 Pro Tip: browse the complete ParseForge collection for more AI-ready web tools.

🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.

⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any website or crawler framework. Only publicly accessible pages are crawled. Robots.txt rules are respected by default. Always honor the terms of service of the sites you crawl.

Website Content Crawler

mikolabs/website-content-crawler

Deep-crawl websites to extract clean text, Markdown, or HTML for AI/LLM apps, RAG pipelines, and vector databases. Supports adaptive crawling, HTML cleaning, file downloads, and structured dataset output. Easily integrates with LangChain, LlamaIndex, and other LLM tools.

mikolabs

5.0

Social Media Hashtag Research

apify/social-media-hashtag-research

Extract information about posts across social media with the same hashtag. Get details from Instagram, YouTube, TikTok, and Facebook about posts’ URLs, captions, timestamps, likes, dislikes, views and comments count, and basic profile info. You can download your data in JSON, CSV, Excel, and more.

Apify

1.7K

5.0

Bulk Image Downloader

thirdwatch/bulk-image-downloader

Download every image from any webpage or direct image URL - at scale. Smart srcset handling picks the highest-resolution variant. Optional sha256 dedup, EXIF stripping for privacy, and minimum size/width filters.

Thirdwatch

XING Data Extractor

epctex/xing-scraper

Unleash the power of advanced scraping to gather comprehensive data on companies, jobs, profiles, communities, and groups from XING. Extract job descriptions, images, company details, summarizations and more. Customize your search terms, filters, and mappings for precision.

epctex

398

5.0

🏆📱 Social Media Trend Scraper 6-in-1 | AI Analysis 🤖

manju4k/social-media-trend-scraper-6-in-1-ai-analysis

🔥THE ONLY scraper for TikTok, Instagram, YouTube, Reddit, Twitter & Pinterest with AI analysis! Get viral hashtags, engagement insights & actionable recommendations. 10 regions, 6 time ranges, 5 analysis types. 95% cost savings vs manual research. Perfect for creators, marketers, agencies📱🤖

Manjunath K

368

5.0

Dynamic Web Scraper

josejet/dynamic-web-scraper

Dynamic Web Scraper is an Apify Actor that gathers information online by simulating user browsing behavior on the web. It reduces the time and amount of scraped web pages by using a model (ChatGPT) to make decisions regarding browser navigation and results evaluation.

Pepa J

351

Social Links Scraper

akaza010/Social-Links-Scraper

This actor crawls a list of input URLs and extracts social media links (e.g. Facebook, Twitter/X, Instagram, LinkedIn, YouTube, TikTok, etc.) that exist on those pages.

Akaza

5.0

Website Social Scraper

burbn/website-social-scraper

Stop searching manually! ✋ Get every social media handle from a list of URLs in seconds. ⚡ Bulk scrape LinkedIn, Instagram, TikTok & Twitter/X with ease. Clean table views for high-quality lead prospecting! 📊

Kevin

138

All Social media Phone Number Scraper

danny.hub/all-in-social-media-phone-number

ALL social media Phone Number Scraper, extract phone numbers from Facebook/Instagram/Tiktok/Linkedin/Youtube/Twitter/Reddit/Pinterest. Just type keywords, choose the country code and platform, and you can get endless Leads!!!!

Dannyswift.hub

1.7K

1.0

Bulk Image Downloader

trudax/bulk-image-downloader

Download all images from a website with this easy-to-use Bulk Image Downloader. Scrape all images from any website by URL to a zip file with a single click.