Website Content Extractor for RAG: Markdown, HTML, Text
Pricing
from $0.01 / result
Website Content Extractor for RAG: Markdown, HTML, Text
Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.
Pricing
from $0.01 / result
Rating
5.0
(1)
Developer
nezha
Actor stats
1
Bookmarked
17
Total users
4
Monthly active users
3 days ago
Last modified
Categories
Share
Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, internal search, and AI knowledge bases.
What this Actor does
Most teams do not need "a crawler." They need a fast way to turn a website into usable content for:
- embeddings and chunking pipelines
- internal search and AI assistants
- help center or docs ingestion
- markdown, text, or HTML exports without manual copy-paste
This Actor helps by turning website pages into a structured content dataset with cleaned page text, markdown, HTML, headings, crawl metadata, and optional clean HTML records in key-value store.
Quick start
- Paste a docs site, help center, or website URL into Website or Docs URLs.
- Keep
crawlMode: sitemap,maxPages: 3, andoutputFormat: markdownfor the first run. - Click Run.
- Download the dataset or use the API output directly.
Use cases
Docs site to RAG
Crawl developer docs, product docs, or API docs, then export markdown or clean HTML ready for chunking, embeddings, and retrieval.
Help center to AI support
Extract support articles as clean text or markdown for internal search, support copilots, and FAQ assistants.
Website to knowledge base
Capture blog posts, product pages, and guide content as structured text with titles, headings, canonical URLs, and crawl metadata.
Output preview
Here is a simplified preview of the extracted dataset:
| URL | Title | Format | Words | Language | Depth |
|---|---|---|---|---|---|
/academy/web-scraping-for-beginners | Web scraping for beginners | markdown | 1842 | en | 1 |
/academy/api-integration-guide | API integration guide | markdown | 1267 | en | 1 |
/academy/rag-pipeline-basics | RAG pipeline basics | markdown | 2135 | en | 1 |
The same record can also include:
| Extra field group | Example value |
|---|---|
| Content outputs | content, markdown, text, html |
| Structure signals | title, description, headings, canonicalUrl |
| Crawl metadata | depth, httpStatusCode, language, wordCount, crawledAt |
| Clean HTML storage | CLEAN_HTML_INDEX plus separate clean HTML records |
| Run diagnostics | OUTPUT_SUMMARY, FAILED_PAGES, SKIPPED_PAGES |
Typical fields include:
- page identity:
url,title,description,canonicalUrl - main content outputs:
content,markdown,text,html,cleanHtml - page structure:
headings - crawl metadata:
contentFormat,wordCount,language,depth,httpStatusCode,crawledAt - run-level outputs:
OUTPUT_SUMMARY,FAILED_PAGES,SKIPPED_PAGES,CLEAN_HTML_INDEX
Full JSON preview
If you want to inspect a more complete example record, open the preview below.
Examples
Option 1: Crawl directly from website pages
Best when you want to start from one section and follow links recursively.
{"startUrls": [{"url": "https://docs.apify.com/academy"}],"crawlMode": "website","outputFormat": "markdown","maxPages": 20,"maxDepth": 2,"sameDomainOnly": true,"saveCleanHtml": true}
Option 2: Crawl from sitemap URLs
Best when the target site already has a sitemap and you want broader coverage with cleaner URL discovery.
{"startUrls": [{"url": "https://docs.apify.com/academy"}],"crawlMode": "sitemap","sitemapUrls": ["https://docs.apify.com/sitemap.xml"],"maxPages": 50,"maxDepth": 0,"outputFormat": "markdown","sameDomainOnly": true,"saveCleanHtml": true}
Best practices
This Actor does more than return a list of URLs.
- You get the main content in markdown, text, and HTML.
- You get structure signals such as titles, headings, descriptions, and canonical URLs.
- You get crawl metadata such as word count, depth, language, status code, and crawl time.
- You can store clean HTML separately for downstream parsing or chunking.
- You also get run diagnostics for failed pages, skipped pages, and summary totals.
That combination makes the output useful not just for scraping, but for ingestion, QA, chunking, embeddings, search, and AI application pipelines.
API access
Developers can run this Actor programmatically through the Apify API or the Apify Python and JavaScript clients.
- API reference: Apify API
- Client docs: Apify clients