Website Content Crawler for RAG
Pricing
from $0.01 / result
Website Content Crawler for RAG
Crawl documentation sites, help centers, blogs, and websites, then extract clean markdown, text, or HTML for RAG pipelines, vector databases, and LLM applications.
Pricing
from $0.01 / result
Rating
0.0
(0)
Developer
yun qing
Actor stats
0
Bookmarked
10
Total users
2
Monthly active users
10 days ago
Last modified
Categories
Share
Crawl docs sites, help centers, blogs, and websites, then extract clean content as markdown, text, or HTML for RAG, vector databases, and LLM pipelines.
Built for:
- AI engineers
- RAG developers
- Knowledge base teams
- Developer tooling teams
Why use this Actor?
- Crawl from start URLs or sitemap URLs
- Keep the crawl inside your target scope
- Filter out PDFs and non-HTML files
- Store clean HTML separately for downstream processing
- Export markdown, text, or HTML depending on your ingestion workflow
Typical use cases
- Crawl product documentation into a vector database
- Ingest help center content into an internal knowledge base
- Extract clean website content for LLM applications
- Capture docs and blog content for search or analysis
What makes it useful for content ingestion
sitemapmode for docs and help center sites- scope control to avoid crawling unrelated pages
- PDF and file filtering to keep the output focused
- clean HTML storage for downstream parsing and chunking
- markdown, text, and HTML outputs for different pipelines
Recommended first run
If this is your first run, start with:
- 1 start URL or 1 sitemap URL
contentFormat: markdown- a conservative
maxDepth - file filtering enabled
Good first-run targets:
- a product docs site
- a help center
- a blog section
Example workflows
1. Docs site to RAG
Use the Actor to crawl a documentation site, then send the markdown or clean HTML output into your chunking and embedding pipeline.
Best for:
- internal developer docs
- product documentation
- public API docs
2. Help center to knowledge base
Crawl support articles from a help center and export them as clean text or markdown for:
- internal search
- support copilots
- FAQ assistants
3. Website content extraction for LLM apps
Collect structured content from blogs, docs, and product pages to build:
- retrieval systems
- internal knowledge tools
- content analysis workflows
Typical input
{"startUrls": [{ "url": "https://docs.apify.com/" }],"crawlMode": "website","contentFormat": "markdown","maxDepth": 2,"excludeFileExtensions": [".pdf", ".zip", ".doc", ".docx", ".ppt", ".pptx"]}
Local development
pnpm actor:dev websiteContentCrawler --example 0 --force-inputpnpm actor:dev websiteContentCrawler --example 2 --force-input
Notes:
input-examples.jsonis used by localactor:dev- Apify platform automated testing uses the
prefillvalues from.actor/input_schema.json - The schema uses a public default URL so automated testing can pass without relying on localhost
Build
$pnpm actor:build websiteContentCrawler
Publish
pnpm actor:push websiteContentCrawlerpnpm actor:push websiteContentCrawler --dry-runpnpm actor:push websiteContentCrawler --sync-meta --prefer-local-meta
Dataset Output
Each dataset item includes:
urltitledescriptioncontentcontentFormatcleanHtmlmarkdowntexthtmlwordCountlanguagecanonicalUrldepthhttpStatusCodecrawledAt
Crawl Modes
website: start fromstartUrls, then follow links recursivelysitemap: load URLs fromsitemapUrlsor fallbackorigin + /sitemap.xml
Separate Clean HTML Storage
CLEAN_HTML_INDEXstores the mapping between page URL and KVS record key- Individual cleaned HTML records are stored as
CLEAN_HTML_000001,CLEAN_HTML_000002, and so on